Coalition is the world's first Active Insurance provider designed to help prevent digital risk before it strikes. They are seeking a Senior Site Reliability Engineer to build and operate infrastructure and tools that empower developers to deliver scalable and reliable software.
Responsibilities:
- Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform, driving architectural decisions that improve long-term maintainability and reliability
- System Reliability: Lead efforts to improve platform resilience through failure-based testing, automated recovery strategies, and proactive capacity planning
- Developer Enablement: Own the design and delivery of reusable platform components and self-service tools that streamline the developer experience and reduce cross-team toil
- Observability: Define and evolve observability standards across the platform, including system metrics, distributed tracing, and SLO frameworks
- Project Ownership: Own projects end to end—from initial scoping and effort estimation through detailed planning, execution, and successful rollout
- Mentorship & Standards: Mentor engineers across the team, uphold high infrastructure quality, and actively shape the best practices and standards used by the organization
- Collaboration: Engage in technical design discussions, providing guidance and feedback while adapting strategies based on team input and evolving requirements
Requirements:
- 6+ years of experience in SRE, DevOps, Cloud Engineering, or Software Development roles
- Hands-on experience operating production environments in AWS
- Proficiency in Go or Python, with experience building production-grade automation, tooling or libraries
- Strong experience with Terraform
- Experience with container orchestration platforms like ECS or Kubernetes
- Familiarity with CI/CD tools such as GitHub Actions
- Experience designing and implementing re-usable platform components based on team requirements
- Solid understanding of observability practices including system metrics, distributed tracing, and SLOs
- Exposure to failure-based testing approaches and automated recovery strategies
- Strong leadership and communication skills, both written and verbal
- Experience evangelizing reliability best practices
- Experience with microservices architectures
- Exposure to Kafka or other event streaming systems
- Experience building internal developer platforms or self-service infrastructure
- Familiarity with systems security, compliance requirements, or hardening practices