Define, architect, and set standards for composable IaC (CDK or Terraform) patterns for Cloud Infrastructure (EKS).
Drive the adoption and implement composable, idempotent, multi-environment GitOps workflows.
Optimize scalability and cost-per-performance using metrics-driven automation and autoscaling technologies such as Karpenter.
Develop and maintain observability across the platform using Prometheus, Grafana, and distributed tracing.
Lead cross-functional efforts with application teams to define SLOs and capacity models for mission-critical services.
Lead major production incident response efforts, drive blameless postmortems, and mentor other engineers on production incident response, postmortems, and reliability reviews.
Create and operate resilient CI/CD pipelines for safe, rapid deployments and rollbacks.
Champion automation, low-toil operations, and a culture of continuous improvement.
Develop agentic workflows utilizing the team-wide context layer and operational data to accelerate development without compromising reliability.
Act as a technical leader and mentor to elevate the team, with a strong ability to listen, evaluate, and give constructive feedback on ideas.
Requirements
Demonstrated expertise in architecting, deploying, and maintaining high-throughput, low-latency distributed systems in cloud production environments, ideally in Platform, SRE, or DevOps roles. Prior ownership of a stateful deployment
Deep expertise in systems-level coding skills for automation and systems development (Go, Python, or TypeScript).
Proven experience operating Kubernetes at scale (EKS preferred) and applying IaC patterns (CDK, Terraform).
Working knowledge of GitOps and reconciliation loops in Kubernetes controllers.
Solid experience with CI/CD systems (GitHub Actions, AWS CodePipeline)
Expert in defining, designing, and optimizing global monitoring and alerting pipelines (PromQL, metrics correlation, alert noise reduction).
Experience with large-scale streaming or ad-serving workloads, including HTTP-based delivery (oRTB, VAST), event streaming (Kafka), and AWS network architecture (VPC, load balancers, peering)
Understanding of cloud security best practices (IAM, encryption, network segmentation, zero trust).
Proven ability to conduct deep performance analysis, tuning, and optimization across the entire infrastructure stack to achieve optimal cost-per-performance and latency targets.
Tech Stack
AWS
Cloud
Distributed Systems
Grafana
Kafka
Kubernetes
Prometheus
Python
Terraform
TypeScript
Go
Benefits
Strong Medical, Dental and Vision Benefits, 100% paid by Wurl
Remote first policy
Flexible Time Off
10 US Holidays
401(k) Matching
Pre-Tax Savings Plans, HSA & FSA
Ginger, Aaptiv and Headspace subscriptions for mental and physical wellness
OneMedical subscription for 24/7 convenient medical care
Paid Maternity and Parental Leave for all family additions
Discounted PetPlan and easy at home access to Covid testing with empowerDX