Customer.io is a platform used by over 8,000 companies for automated communication. They are seeking a Site Reliability Engineer to scale their infrastructure, reduce operational toil, and improve reliability as the company grows.
Responsibilities:
- Build and scale infrastructure to support billions of messages per day and real-time events
- Automate deployments, alerting, and incident response
- Make our on-call better - clear alerts, solid documentation, and faster resolution
- Tune MySQL and other datastore performance and improve reliability across distributed systems
- Collaborate across teams to debug, ship, and support systems in production
- Share knowledge and raise the bar through sharing your progress publicly with short videos, thoughtful writing, and mentorship
- Leverage AI tools to prototype, move faster, and make better decisions
Requirements:
- 7+ years in SRE or infrastructure roles, improving production systems at scale
- Deep MySQL experience - schema design, performance tuning, and operational tooling
- Fluency in cloud-native tech (GCP a plus) and Terraform
- Proficiency in Go and Bash for scripting and systems programming
- Skill in observability, incident response, and debugging distributed systems
- A preference for action over perfection, and pride in owning technical decisions