Teladoc Health is empowering individuals to live their healthiest lives through virtual care. They are seeking a Senior Site Reliability Engineer to ensure the reliability, automation, and infrastructure-as-code for their modern Data & AI platform, focusing on maintaining an Azure-based data ecosystem.

Responsibilities:

Build and maintain Terraform modules for data platform services (Snowflake, Airbyte, Astronomer, dbt, Kafka)
Develop IaC standards, GitOps workflows, and automated CI/CD pipelines using GitHub Actions
Migrate manual configurations to fully codified infrastructure and enable self‑service provisioning for engineers
Implement monitoring, alerting, and SLO/SLIs for data pipelines and platform components
Lead incident response, root cause analysis, and postmortems
Create automation, runbooks, and self‑healing capabilities to reduce MTTR
Design secure connectivity patterns between Azure and AWS vendor systems
Troubleshoot networking, VPN, private endpoints, DNS, and MFT integrations
Build CI/CD pipelines using GitHub Actions for infrastructure changes with comprehensive testing (terraform plan, validate, compliance checks)
Implement policy-as-code using tools like Sentinel, OPA, or Azure Policy integrated into GitHub workflows
Develop testing frameworks for infrastructure code (Terratest, kitchen-terraform) with automated execution in GitHub Actions
Improve abstractions and tooling to streamline development workflows
Optimize Snowflake compute usage and Airflow/dbt performance
Apply cloud cost management practices and tagging strategies
Support capacity planning and forecasting
Lead complex troubleshooting efforts across distributed systems spanning multiple cloud providers
Debug integration issues with Kafka streams, CDC patterns, and real-time data pipelines
Resolve platform-wide incidents involving Snowflake, Astronomer, Airbyte, and downstream BI tools (PowerBI, Tableau, Cube Cloud)
Partner with vendors for escalated support cases and coordinate resolution across multiple teams

Requirements:

7+ years in Site Reliability Engineering, DevOps, or Platform Engineering roles
5+ years production experience with Terraform at scale
Strong Azure expertise; AWS experience beneficial
Experience operating cloud-based data platforms (Snowflake, Airflow, etc.)
Expert GitHub knowledge (pull requests, Actions, branching strategies)
Strong troubleshooting skills across distributed systems, networking, and data pipelines
Proficient in Python, Bash, PowerShell; able to read SQL and YAML/JSON
Strong experience with containerization and orchestration (Docker, Kubernetes)
Healthcare data experience (FHIR, HL7, claims data)
Kafka experience, dbt administration, BI tools (PowerBI/Tableau)
Experience with data quality frameworks and synthetic data generation
Policy-as-code tools (Sentinel, OPA, Checkov)

Sr. Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: