Symmetrio is a rapidly growing healthcare technology organization focused on advanced healthcare technology solutions. They are seeking a Principal Site Reliability Engineer (SRE) to ensure the reliability, scalability, security, and performance of a mission-critical SaaS platform supporting healthcare providers across the United States.

Responsibilities:

Serve as the primary technical owner for production reliability across U.S. customer environments
Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations
Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact
Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience
Partner with software engineering and platform teams to identify recurring reliability risks and implement sustainable solutions
Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths
Support customer onboarding initiatives by troubleshooting connectivity challenges and ensuring consistent implementation processes
Enhance platform observability through improvements in monitoring, logging, alerting, tracing, and operational dashboards
Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety and operational consistency
Develop operational tooling that supports incident response, troubleshooting, onboarding, and system monitoring activities
Collaborate with engineering leadership to improve cloud architecture, scalability, security, and operational readiness
Partner with customer-facing teams to communicate technical issues, remediation plans, and reliability improvements in a clear and effective manner
Support compliance, security, and risk management initiatives within highly regulated healthcare environments

Requirements:

6+ years of hands-on experience supporting and managing AWS-based production environments
4+ years of experience supporting web applications and backend services (Python/Django experience strongly preferred)
Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups
Strong experience with Terraform and infrastructure-as-code deployment practices
Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies
Experience building and supporting CI/CD pipelines and release automation processes
Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools
Experience leading production incidents, outage management, and root cause analysis initiatives
Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts
Healthcare technology, healthcare SaaS, clinical software, or other regulated industry experience
Bachelor's degree in Computer Science, Engineering, Information Technology, or a related technical field

Principal Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: