CardioOne partners with independent cardiologists to provide innovative solutions that improve patient outcomes and reduce costs. They are seeking a highly skilled Site Reliability Engineer to ensure the reliability, scalability, security, and performance of their production systems and services.
Responsibilities:
- Ensure high availability, scalability, and performance of production systems
- Implement and maintain SLIs, SLOs, and SLAs for critical services
- Conduct capacity planning and performance tuning
- Automate infrastructure provisioning using IaC tools such as Terraform and Terragrunt, ansible
- Develop automation to minimize manual operations and improve deployment workflows
- Build CI/CD pipelines to support rapid and reliable deployments
- Design and maintain monitoring, logging, and alerting systems (Datadog)
- Participate in on-call rotations and lead incident response efforts
- Perform root-cause analysis and develop postmortems to prevent recurring issues
- Manage cloud infrastructure (AWS, Azure) and container orchestration platforms (Kubernetes, ECS)
- Optimize system architecture for reliability and fault tolerance
- Implement best practices for security, networking, and service resilience
- Work closely with development teams to design reliable microservices and distributed systems
- Advocate for SRE principles and drive operational excellence across engineering teams
- Mentor engineers on reliability practices, tooling, and automation strategies
Requirements:
- Bachelor's degree in Computer Science, Engineering, or equivalent experience
- 3–7 years of experience in SRE, DevOps, or Systems Engineering roles
- Strong proficiency with Linux systems and shell scripting
- Experience with cloud platforms (AWS, Azure)
- Hands-on experience with Kubernetes/ECS and container technologies (Docker)
- Proficiency in at least one programming language: Python or Java
- Experience with CI/CD pipelines and DevOps tooling
- Strong understanding of distributed systems, networking, and security fundamentals
- Experience with observability stacks (OpenTelemetry)
- Knowledge of database management (PostgreSQL)
- Experience with configuration management tools (Ansible, Chef, Puppet)
- Familiarity with zero-downtime deployments and chaos engineering practices