Amwell is transforming healthcare through technology and innovation. The Site Reliability Engineer is responsible for building and operating shared infrastructure, ensuring it is predictable, resilient, and supports developer velocity.
Responsibilities:
- Implement cloud infrastructure in AWS using approved patterns and guardrails
- Support EKS based runtime foundations, including cluster add-ons and shared services
- Build environment parity across nonprod and prod and flag any required divergence early with evidence
- Help make cloud primitives predictable, supportable, and easy to consume
- Develop and maintain reusable platform modules and templates using Terraform or CDKTF where applicable
- Contribute to baseline building blocks: VPC patterns, IAM primitives, EKS base clusters, ingress patterns, secrets, and shared data stores as assigned
- Keep modules consumable through sane defaults, versioning, changelogs, and upgrade guidance
- Reduce drift by enforcing standards through code, not documentation alone
- Improve CI workflows for infrastructure changes: plan and apply safety, policy checks, drift detection, and promotion across environments
- Remove manual steps from provisioning and onboarding by turning them into pipelines and documented runbooks
- Support internal module consumption patterns, including examples and reference implementations
- Favor repeatability and clarity over clever one-off solutions
- Operate platform owned services with an ownership mindset. Ownership is not optional
- Participate in on call for platform services and follow incident procedures
- Write and maintain runbooks, dashboards, and alerts for what you ship
- Drive post-incident follow-ups that reduce repeat failures
- Implement least privilege IAM patterns and secure by design defaults
- Partner with Security to integrate controls into pipelines and platform defaults
- Treat auditability as a feature: logs, approvals, traceability, and evidence
- Follow established governance and exception processes and document deviations
Requirements:
- 3 plus year's experience in platform engineering, DevOps, SRE, or infrastructure engineering
- Working experience with AWS and infrastructure as code (Terraform preferred, CDKTF acceptable)
- Practical Kubernetes experience, preferably EKS (deploying, operating, debugging)
- Comfort with networking fundamentals: DNS, TLS, routing, load balancers, and security groups
- Ability to debug pipelines and distributed failures without guessing
- Strong written communication: design notes, runbooks, and crisp status updates