Gifthealth is revolutionizing the way people experience healthcare by simplifying the process of managing prescriptions and health services. They are seeking a Lead Site Reliability Engineer to build reliable, scalable software systems and support DevOps practices that enhance application performance and resilience.
Responsibilities:
- Designs, builds, and maintains reliable, scalable software systems supporting Ruby on Rails applications
- Embeds reliability, performance, and operational best practices into application code and development workflows
- Owns DevOps practices including CI/CD reliability, deployment strategies, and release safety
- Leads incident response, debugging, and root cause analysis across application and platform layers
- Implements and evolves observability (logging, metrics, tracing) within application and service code
- Partners with engineering teams on architecture, capacity planning, and technical standards
Requirements:
- Bachelor's degree in computer science, engineering, or related field OR equivalent professional experience in software engineering, SRE, or DevOps roles
- 5+ years of experience in software engineering, SRE, or DevOps roles
- Hands-on experience building and operating Ruby on Rails applications in production
- Experience in owning production incidents and application-level reliability
- Knowledge of Ruby on Rails application architecture and production operations; software reliability engineering principles (SLOs, SLIs, error budgets); and modern DevOps and CI/CD practices
- Strong software engineering skills (Ruby and/or comparable backend languages)
- Debugging and performance optimization of production applications skills
- CI/CD pipelines, deployment automation, and release tooling skills
- Monitoring and observability tooling (Datadog, New Relic, Prometheus, etc.) skills
- Ability to write production-quality code that improves system reliability
- Ability to collaborate with product and engineering teams to influence design decisions
- Ability to troubleshoot complex, cross-system failures
- Cloud platform certifications (AWS, GCP, Azure)
- SRE or DevOps-focused certifications
- Experience in high-growth or scaling engineering organizations
- Experience working in regulated or customer-impact–sensitive environments
- Knowledge of security and compliance considerations in production systems
- Infrastructure as Code (Terraform or similar) skills
- Containerization and orchestration (Docker) skills
- Ability to mentor engineers on operational ownership and reliability practices
- Ability to balance speed of delivery with long-term system health