Coalition, Inc. is the world's first Active Insurance provider focused on preventing digital risks. They are seeking a Site Reliability Engineer II to join their Platform SRE team, where the role involves building and operating infrastructure, tools, and processes that enable developers to deliver reliable software.
Responsibilities:
- Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform
- System Reliability: Improve the resilience and operability of our platform through failure-based testing and automated recovery strategies
- Developer Enablement: Design and implement reusable platform components and self-service tools to streamline the developer experience
- Observability: Implement and maintain robust observability practices, including system metrics, distributed tracing, and SLO management
- Collaboration: Participate in technical design discussions, sharing feedback and adapting strategies based on team input and evolving requirements
- Standards & Best Practices: Uphold high infrastructure quality and actively contribute to the team's evolving best practices and standards
- On-Call: Participate in a low-volume on-call rotation
Requirements:
- 4+ years in SRE, DevOps, Cloud Engineering, or Software Development roles
- Hands-on experience operating and scaling production environments within AWS
- Strong expertise with Terraform for managing complex cloud infrastructure
- Proficiency in Go or Python, with experience building production-grade automation, tooling, or libraries
- Experience with ECS or Kubernetes
- Familiarity with modern deployment tools, specifically GitHub Actions
- Strong written and verbal skills with a knack for evangelizing reliability best practices across the organization
- Experience troubleshooting complex distributed systems in a high-traffic production environment
- Exposure to event streaming systems such as Kafka or Kinesis
- Experience contributing to Internal Developer Platforms (IDP) or automating self-service infrastructure workflows
- Familiarity with systems security, compliance requirements, or infrastructure hardening