Design, implement, maintain, and monitor reliable production systems at scale.
Lead incident response, mitigate production issues, and conduct post mortem analysis.
Proactively monitor performance, analyze system failures, identify bottlenecks, and propose solutions.
Create and support observability/monitoring tools and vendor integrations.
Drive the growth of a reliability culture, promoting cross-functional collaboration towards improving system reliability, scalability, resilience, and security.
Train and mentor other engineers.
Requirements
5+ years of experience as a reliability-focused engineer in a fast-paced, rapidly growing, enterprise environment.
Deep understanding of tooling and application development in these areas:
Cloud computing such as AWS, Azure, and/or GCP.
Infrastructure as code tools such as terraform or crossplane.
Developing applications in languages such as python, ruby, or go.
Deploying and supporting applications in Kubernetes at scale.
Implementing monitoring in tools like grafana, new relic, or datadog.
Experience debugging live, critical production issues.
Familiarity with reliability principles, such as resilient systems, application and supply chain security, and SLO governance.
Ability to work cross-functionally with diverse engineering teams.