Coalition, Inc. is the world's first Active Insurance provider designed to help prevent digital risk before it strikes. They are seeking a Site Reliability Engineer to build and operate the infrastructure and tools that empower developers to deliver scalable and reliable software.
Responsibilities:
- Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform
- System Reliability: Improve the resilience and operability of our platform through failure-based testing and automated recovery strategies
- Developer Enablement: Design and implement reusable platform components and self-service tools to streamline the developer experience
- Observability: Implement and maintain robust observability practices, including system metrics, distributed tracing, and SLO management
- Mentorship & Standards: Guide junior engineers, uphold high infrastructure quality, and contribute to the team’s evolving best practices
- Collaboration: Participate in technical design discussions, sharing feedback and adapting strategies based on team input and evolving requirements
Requirements:
- 4+ years in SRE, DevOps, Cloud Engineering, or Software Development roles
- Hands-on experience operating and scaling production environments within AWS
- Strong expertise with Terraform for managing complex cloud infrastructure
- Proficiency in Go or Python, with experience building production-grade automation, tooling, or libraries
- Experience with ECS or Kubernetes
- Familiarity with modern deployment tools, specifically GitHub Actions
- Strong written and verbal skills with a knack for evangelizing reliability best practices across the organization
- Experience troubleshooting complex distributed systems in a high-traffic production environment
- Exposure to event streaming systems such as Kafka or Kinesis
- Experience contributing to Internal Developer Platforms (IDP) or automating self-service infrastructure workflows
- Familiarity with systems security, compliance requirements, or infrastructure hardening