Design, implement, maintain, and monitor reliable production systems at scale.
Lead incident response, mitigate production issues, and conduct post mortem analysis.
Proactively monitor performance, analyze system failures, identify bottlenecks, and propose solutions.
Create and support observability/monitoring tools and vendor integrations.
Drive the growth of a reliability culture, promoting cross-functional collaboration towards improving system reliability, scalability, resilience, and security.
Train and mentor other engineers.

5+ years of experience as a reliability-focused engineer in a fast-paced, rapidly growing, enterprise environment.
Deep understanding of tooling and application development in these areas:
Cloud computing such as AWS, Azure, and/or GCP.
Infrastructure as code tools such as terraform or crossplane.
Developing applications in languages such as python, ruby, or go.
Deploying and supporting applications in Kubernetes at scale.
Implementing monitoring in tools like grafana, new relic, or datadog.
Experience debugging live, critical production issues.
Familiarity with reliability principles, such as resilient systems, application and supply chain security, and SLO governance.
Ability to work cross-functionally with diverse engineering teams.

Company-subsidized medical, dental, & vision plans
401(k) plan with company match
Annual bonus
Flexible PTO to encourage a healthy work/life balance (2 weeks STRONGLY encouraged!)
Generous paid leave programs, including 16-week paid parental leave and disability benefits
Workplace flexibility and modern work schedules focused on getting the job done, not hours clocked
Company-wide in-person events and team outings
Lifestyle enhancement program
Company equipment provided (Windows & Mac options)
Annual performance reviews with opportunities for growth and career development

Senior Site Reliability Engineer, SRE

Key skills