SS&C Technologies is a leading financial services and healthcare technology company headquartered in Windsor, Connecticut. They are seeking a highly skilled Site Reliability Engineer to join their Operations team, responsible for ensuring the availability, performance, scalability, and reliability of systems and services.

Responsibilities:

Maintain and improve the uptime, performance, and availability of production systems
Define and track SLIs, SLOs, and SLAs to ensure service reliability and user satisfaction
Implement and manage monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK)
Participate in on-call rotations and respond to incidents, performing root cause analysis and postmortems
Automate repetitive tasks and processes using scripts, configuration management, and Infrastructure as Code (IaaC)
Develop CI/CD pipelines to streamline deployment and operational processes
Analyze system performance and capacity trends to plan for future growth
Collaborate with engineering teams to design systems that scale reliably
Support cloud and/or hybrid infrastructure (AWS, Azure, GCP, VMware, etc.)
Manage system provisioning, configuration, and patching via tools such as Ansible, Terraform, or Puppet
Act as a bridge between development and operations teams, championing DevOps and SRE principles
Contribute to a culture of continuous improvement, reliability, and accountability

Requirements:

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
3+ years of experience in a Site Reliability, DevOps, or Systems Engineering role
Experience with Linux/Unix systems, Windows, shell scripting, and administration
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
Hands-on experience with cloud platforms (AWS, Azure, or GCP)
Strong knowledge of networking, security, load balancing, and DNS
Experience with monitoring/logging tools (e.g., Prometheus, Grafana, ELK, Splunk, Datadog)
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
Familiarity with ITIL processes, incident/change/problem management frameworks
Exposure to compliance and security standards (e.g., ISO 27001, SOC 2, HIPAA)
Experience in large-scale distributed systems and microservices architectures

Site Reliability Engineer (SRE) – Operations

Key skills

About this role

Responsibilities:

Requirements: