Teladoc Health is empowering individuals to live their healthiest lives through virtual care. They are seeking a Senior Site Reliability Engineer to ensure the reliability, automation, and infrastructure-as-code for their modern Data & AI platform, focusing on maintaining an Azure-based data ecosystem.
Responsibilities:
- Build and maintain Terraform modules for data platform services (Snowflake, Airbyte, Astronomer, dbt, Kafka)
- Develop IaC standards, GitOps workflows, and automated CI/CD pipelines using GitHub Actions
- Migrate manual configurations to fully codified infrastructure and enable self‑service provisioning for engineers
- Implement monitoring, alerting, and SLO/SLIs for data pipelines and platform components
- Lead incident response, root cause analysis, and postmortems
- Create automation, runbooks, and self‑healing capabilities to reduce MTTR
- Design secure connectivity patterns between Azure and AWS vendor systems
- Troubleshoot networking, VPN, private endpoints, DNS, and MFT integrations
- Build CI/CD pipelines using GitHub Actions for infrastructure changes with comprehensive testing (terraform plan, validate, compliance checks)
- Implement policy-as-code using tools like Sentinel, OPA, or Azure Policy integrated into GitHub workflows
- Develop testing frameworks for infrastructure code (Terratest, kitchen-terraform) with automated execution in GitHub Actions
- Improve abstractions and tooling to streamline development workflows
- Optimize Snowflake compute usage and Airflow/dbt performance
- Apply cloud cost management practices and tagging strategies
- Support capacity planning and forecasting
- Lead complex troubleshooting efforts across distributed systems spanning multiple cloud providers
- Debug integration issues with Kafka streams, CDC patterns, and real-time data pipelines
- Resolve platform-wide incidents involving Snowflake, Astronomer, Airbyte, and downstream BI tools (PowerBI, Tableau, Cube Cloud)
- Partner with vendors for escalated support cases and coordinate resolution across multiple teams
Requirements:
- 7+ years in Site Reliability Engineering, DevOps, or Platform Engineering roles
- 5+ years production experience with Terraform at scale
- Strong Azure expertise; AWS experience beneficial
- Experience operating cloud-based data platforms (Snowflake, Airflow, etc.)
- Expert GitHub knowledge (pull requests, Actions, branching strategies)
- Strong troubleshooting skills across distributed systems, networking, and data pipelines
- Proficient in Python, Bash, PowerShell; able to read SQL and YAML/JSON
- Strong experience with containerization and orchestration (Docker, Kubernetes)
- Healthcare data experience (FHIR, HL7, claims data)
- Kafka experience, dbt administration, BI tools (PowerBI/Tableau)
- Experience with data quality frameworks and synthetic data generation
- Policy-as-code tools (Sentinel, OPA, Checkov)