The ReWork Group partners with high-growth startups to build the future. They are seeking a Senior Site Reliability Engineer to develop reliable infrastructure and proactive monitoring solutions as their client scales from thousands to millions of users.
Responsibilities:
- Lead incident response and establish sustainable on-call practices, including comprehensive runbooks, blameless postmortems, and systematic improvements that reduce MTTR
- Develop and maintain self-service observability solutions using modern monitoring tools that provide actionable insights for troubleshooting and performance optimization
- Create and maintain infrastructure as code (using Terraform, CloudFormation) that allows for consistent, scalable, and secure cloud environments on AWS
- Partner closely with feature teams to architect resilient infrastructure for critical components (databases, networking, async workflows, data pipelines) that scale seamlessly
- Work closely with DevX to design and implement robust CI/CD pipelines with advanced deployment strategies (blue/green, canary) that enable teams to ship confidently and rapidly
- Advocate for best practices early in feature design, ensuring we design with reliability in mind and future-proof our services
Requirements:
- 5+ years in SRE or DevOps — or 7+ years in software engineering with a serious infrastructure focus and the scars to prove it
- You've led incident response for high-availability production systems — you run tight RCAs, you drive blameless postmortems, and you leave every incident with a team that's smarter than before
- You've designed highly available deployment architectures across multiple targets — EC2, Fargate, and beyond — with real expertise in auto-scaling, health checks, and graceful degradation when things get hard
- You've implemented monitoring and observability solutions that actually get used — Datadog, Prometheus, ELK, or comparable — and you've made the case internally for why observability isn't optional
- Deep AWS fluency and a strong infrastructure-as-code practice — Terraform is your default, not your fallback
- You've built and improved CI/CD pipelines that give engineering teams the confidence to ship fast and reliably