Cleerly is a healthcare company revolutionizing heart disease diagnosis and treatment. They are seeking a highly skilled Site Reliability Engineer to ensure the health and integrity of their systems, focusing on cloud infrastructure and system reliability within AWS.
Responsibilities:
- Cloud Environment Buildout: Stand up and harden the new Hub cloud environment and deployment pipeline, ensuring reliability, security, and repeatability
- Infrastructure Management: Design, develop, and manage cloud infrastructure using AWS services, Terraform (Infrastructure as Code), and Docker containers
- System Integrity: Use strong system administration and network engineering skills to ensure the reliability, scalability, and performance of all platform systems
- Own Observability & Incidents: Own observability and incident readiness end-to-end, including third-party connectivity patterns, runtime guardrails, and defining upgrade strategies (canary/rollback). This ensures the platform can scale safely as new AI integrations are added
- Drive DevOps Automation: Implement DevOps methodologies and tools, facilitating Continuous Integration (CI), Continuous Delivery (CD), and the automation of infrastructure management tasks
- Reduce Toil: Develop and maintain automation tools to proactively reduce manual operational tasks (toil)
- Security Maintenance: Ensure system and network security is always maintained by implementing and enforcing appropriate security measures across the platform
Requirements:
- 6–10+ years of professional experience running and managing production services on AWS
- Deep understanding of core AWS fundamentals, including VPC networking, IAM, KMS, security groups, and routing
- Expertise with Infrastructure-as-Code (Terraform, CDK, or CloudFormation) and reliable environment replication
- Experience operating and managing container platforms (EKS/ECS) and/or scalable managed services
- Proven ability to design and automate comprehensive CI/CD pipelines (builds, tests, deploys, and rollbacks)
- Deep knowledge of metrics, logs, and traces, along with setting SLOs, configuring robust alerting, and managing structured incident response processes
- Practical High Availability (HA) / Disaster Recovery (DR) thinking, including backup strategies, multi-AZ patterns, and conducting failure drills
- Strong security-by-default posture, including expertise in secrets handling, key rotation, and the principle of least privilege
- Acute performance and cost awareness, including effective use of tagging, budgeting, right-sizing, and autoscaling
- Proven ability to partner with engineering and security teams to achieve rapid deployment goals without compromising system reliability
- Expertise in the Software Development Life Cycle (SDLC) specifically for software medical devices (SaMD)
- Deep experience operating in regulated environments, managing audit logs, strict change control, and comprehensive evidence collection
- Working knowledge of essential medical imaging standards, including DICOM and HL7
- Proven experience developing comprehensive cybersecurity measures and implementing robust data protection and privacy controls across cloud infrastructure
- Experience designing and implementing secure connectivity patterns for healthcare customers, including PrivateLink, VPN, and Direct Connect
- Expertise in container supply-chain security, including SBOM (Software Bill of Materials), signing, scanning, and runtime policy enforcement
- AWS Certified SysOps Administrator – Associate or Professional
- Certified Kubernetes Administrator (CKA)
- Bachelor's degree in computer science, Information Technology, or a related field, or equivalent experience
- Proven experience in Site Reliability Engineering, DevOps, or a similar role