CardioOne partners with independent cardiologists to provide innovative solutions that improve patient outcomes and reduce costs. They are seeking a highly skilled Site Reliability Engineer to ensure the reliability, scalability, security, and performance of their production systems and services.

Responsibilities:

Ensure high availability, scalability, and performance of production systems
Implement and maintain SLIs, SLOs, and SLAs for critical services
Conduct capacity planning and performance tuning
Automate infrastructure provisioning using IaC tools such as Terraform and Terragrunt, ansible
Develop automation to minimize manual operations and improve deployment workflows
Build CI/CD pipelines to support rapid and reliable deployments
Design and maintain monitoring, logging, and alerting systems (Datadog)
Participate in on-call rotations and lead incident response efforts
Perform root-cause analysis and develop postmortems to prevent recurring issues
Manage cloud infrastructure (AWS, Azure) and container orchestration platforms (Kubernetes, ECS)
Optimize system architecture for reliability and fault tolerance
Implement best practices for security, networking, and service resilience
Work closely with development teams to design reliable microservices and distributed systems
Advocate for SRE principles and drive operational excellence across engineering teams
Mentor engineers on reliability practices, tooling, and automation strategies

Requirements:

Bachelor's degree in Computer Science, Engineering, or equivalent experience
3–7 years of experience in SRE, DevOps, or Systems Engineering roles
Strong proficiency with Linux systems and shell scripting
Experience with cloud platforms (AWS, Azure)
Hands-on experience with Kubernetes/ECS and container technologies (Docker)
Proficiency in at least one programming language: Python or Java
Experience with CI/CD pipelines and DevOps tooling
Strong understanding of distributed systems, networking, and security fundamentals
Experience with observability stacks (OpenTelemetry)
Knowledge of database management (PostgreSQL)
Experience with configuration management tools (Ansible, Chef, Puppet)
Familiarity with zero-downtime deployments and chaos engineering practices

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: