Sonio is a mission-driven company focused on improving women's and children's health through technological innovation. As the first Site Reliability Engineer in the US, you will be responsible for the platform’s stability, incident response, and operational knowledge transfer, ensuring a secure and resilient production environment.
Responsibilities:
- Own US coverage for releases and incidents as the first responder during PST hours
- Bridge infra and code by working hand-in-hand with our DevOps team on Kubernetes, Terraform, and AWS, while being able to read and patch Elixir code to unblock yourself without waiting for a backend engineer
- Drive incident response end-to-end, managing triage, mitigation, and blameless post-mortems with real follow-through
- Improve the platform’s operability by defining SLOs, tuning alerts to reduce toil, and pushing observability (metrics, logs, tracing) where it’s lacking
- Transfer operational knowledge from France to the US by authoring runbooks and documenting procedures so local teams are empowered to act when something breaks
- Support compliance and security in our regulated medical-device environment, maintaining HIPAA-aligned controls and an audit-ready infrastructure
Requirements:
- 4+ years of experience in SRE, DevOps, or Production Engineering, including significant on-call experience on a 24/7 product
- You possess a hybrid 'code-literate' mindset, acting as an infrastructure expert who can also navigate a backend codebase to triage and patch issues independently
- You bring strong technical foundations in Kubernetes, Terraform, and AWS, along with the ability to architect and tune your own observability signals
- You are highly autonomous and comfortable making technical decisions with limited supervision, which is essential given the timezone difference with France
- You maintain operational rigor and stay calm under pressure, with the written English skills necessary to produce high-quality runbooks and handle async handoffs