Coalition is the world's first Active Insurance provider designed to help prevent digital risk before it strikes. They are seeking a Senior Site Reliability Engineer to build and operate infrastructure and tools that empower developers to deliver scalable and reliable software.

Responsibilities:

Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform, driving architectural decisions that improve long-term maintainability and reliability
System Reliability: Lead efforts to improve platform resilience through failure-based testing, automated recovery strategies, and proactive capacity planning
Developer Enablement: Own the design and delivery of reusable platform components and self-service tools that streamline the developer experience and reduce cross-team toil
Observability: Define and evolve observability standards across the platform, including system metrics, distributed tracing, and SLO frameworks
Project Ownership: Own projects end to end—from initial scoping and effort estimation through detailed planning, execution, and successful rollout
Mentorship & Standards: Mentor engineers across the team, uphold high infrastructure quality, and actively shape the best practices and standards used by the organization
Collaboration: Engage in technical design discussions, providing guidance and feedback while adapting strategies based on team input and evolving requirements

Requirements:

6+ years of experience in SRE, DevOps, Cloud Engineering, or Software Development roles
Hands-on experience operating production environments in AWS
Proficiency in Go or Python, with experience building production-grade automation, tooling or libraries
Strong experience with Terraform
Experience with container orchestration platforms like ECS or Kubernetes
Familiarity with CI/CD tools such as GitHub Actions
Experience designing and implementing re-usable platform components based on team requirements
Solid understanding of observability practices including system metrics, distributed tracing, and SLOs
Exposure to failure-based testing approaches and automated recovery strategies
Strong leadership and communication skills, both written and verbal
Experience evangelizing reliability best practices
Experience with microservices architectures
Exposure to Kafka or other event streaming systems
Experience building internal developer platforms or self-service infrastructure
Familiarity with systems security, compliance requirements, or hardening practices

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: