Thrive Market is an online membership-based market focused on making healthy and sustainable living accessible to everyone. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and enhance their platform's reliability during rapid growth. The role involves hands-on engineering work as well as strategic planning to ensure systems scale effectively.
Responsibilities:
- Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
- Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
- Establish error budgets and use them to balance feature velocity with reliability investments
- Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
- Design and implement chaos engineering practices to proactively identify failure modes before they impact members
- Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
- Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
- Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
- Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
- Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
- Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
- Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
- Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
- Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
- Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
- Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement
Requirements:
- B.S. in Computer Science or equivalent professional experience
- 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies
- Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration
- Strong systems engineering background with advanced proficiency in Linux administration
- Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages
- Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda
- Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)
- Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments
- Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)
- Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)
- Strong knowledge of web application infrastructure, networking, load balancing, and security best practices
- Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems
- Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale
- Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks
- Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)
- Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)
- Experience building and managing cost-optimization strategies for cloud infrastructure
- Background in establishing SRE practices in organizations transitioning from traditional DevOps models
- Experience with configuration management tools (Ansible, Chef, Puppet, or similar)