Amwell is transforming healthcare through technology and innovation, aiming to provide trusted solutions for the industry's biggest challenges. The Site Reliability Engineer will build and operate shared infrastructure, ensuring it is predictable and resilient, directly impacting developer velocity and operational safety.
Responsibilities:
- Implement cloud infrastructure in AWS using approved patterns and guardrails
- Support EKS based runtime foundations, including cluster add-ons and shared services
- Build environment parity across nonprod and prod and flag any required divergence early with evidence
- Help make cloud primitives predictable, supportable, and easy to consume
- Develop and maintain reusable platform modules and templates using Terraform or CDKTF where applicable
- Contribute to baseline building blocks: VPC patterns, IAM primitives, EKS base clusters, ingress patterns, secrets, and shared data stores as assigned
- Keep modules consumable through sane defaults, versioning, changelogs, and upgrade guidance
- Reduce drift by enforcing standards through code, not documentation alone
- Improve CI workflows for infrastructure changes: plan and apply safety, policy checks, drift detection, and promotion across environments
- Remove manual steps from provisioning and onboarding by turning them into pipelines and documented runbooks
- Support internal module consumption patterns, including examples and reference implementations
- Favor repeatability and clarity over clever one-off solutions
- Operate platform owned services with an ownership mindset. Ownership is not optional
- Participate in on call for platform services and follow incident procedures
- Write and maintain runbooks, dashboards, and alerts for what you ship
- Drive post-incident follow-ups that reduce repeat failures
- Implement least privilege IAM patterns and secure by design defaults
- Partner with Security to integrate controls into pipelines and platform defaults
- Treat auditability as a feature: logs, approvals, traceability, and evidence
- Follow established governance and exception processes and document deviations
Requirements:
- 3 plus year's experience in platform engineering, DevOps, SRE, or infrastructure engineering
- Working experience with AWS and infrastructure as code (Terraform preferred, CDKTF acceptable)
- Practical Kubernetes experience, preferably EKS (deploying, operating, debugging)
- Comfort with networking fundamentals: DNS, TLS, routing, load balancers, and security groups
- Ability to debug pipelines and distributed failures without guessing
- Strong written communication: design notes, runbooks, and crisp status updates