Amwell is transforming healthcare through technology and innovation. The Site Reliability Engineer is responsible for building and operating shared infrastructure, ensuring it is predictable, resilient, and supports developer velocity.

Responsibilities:

Implement cloud infrastructure in AWS using approved patterns and guardrails
Support EKS based runtime foundations, including cluster add-ons and shared services
Build environment parity across nonprod and prod and flag any required divergence early with evidence
Help make cloud primitives predictable, supportable, and easy to consume
Develop and maintain reusable platform modules and templates using Terraform or CDKTF where applicable
Contribute to baseline building blocks: VPC patterns, IAM primitives, EKS base clusters, ingress patterns, secrets, and shared data stores as assigned
Keep modules consumable through sane defaults, versioning, changelogs, and upgrade guidance
Reduce drift by enforcing standards through code, not documentation alone
Improve CI workflows for infrastructure changes: plan and apply safety, policy checks, drift detection, and promotion across environments
Remove manual steps from provisioning and onboarding by turning them into pipelines and documented runbooks
Support internal module consumption patterns, including examples and reference implementations
Favor repeatability and clarity over clever one-off solutions
Operate platform owned services with an ownership mindset. Ownership is not optional
Participate in on call for platform services and follow incident procedures
Write and maintain runbooks, dashboards, and alerts for what you ship
Drive post-incident follow-ups that reduce repeat failures
Implement least privilege IAM patterns and secure by design defaults
Partner with Security to integrate controls into pipelines and platform defaults
Treat auditability as a feature: logs, approvals, traceability, and evidence
Follow established governance and exception processes and document deviations

Requirements:

3 plus year's experience in platform engineering, DevOps, SRE, or infrastructure engineering
Working experience with AWS and infrastructure as code (Terraform preferred, CDKTF acceptable)
Practical Kubernetes experience, preferably EKS (deploying, operating, debugging)
Comfort with networking fundamentals: DNS, TLS, routing, load balancers, and security groups
Ability to debug pipelines and distributed failures without guessing
Strong written communication: design notes, runbooks, and crisp status updates

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: