Lead infrastructure project teams across multiple domains (reliability, developer experience, cloud platform), providing technical direction, maintaining project plans, and keeping leadership and cross-functional stakeholders informed of progress, risks, and tradeoffs.
Partner with engineering leaders and peer Staff+ engineers across the org to align infrastructure strategy, align technical investments with business goals, and provide authoritative technical scope for cross-functional initiatives.
Architect and deliver large, complex infrastructure systems, designing for scale, reliability, and operational simplicity. Drive decisions on build-vs-buy, technology selection, and migration strategy for the domains you lead.
Define and evolve Flex's infrastructure-as-code strategy, including Terraform module architecture, governance standards, and safe rollout patterns. Introduce new IaC tooling or frameworks when existing approaches no longer serve team needs, and drive adoption across engineering.
Lead strategic reliability improvements across services you work with, defining SLI/SLO frameworks with partner teams, delivering net-new ways to measure and communicate operational health and customer impact, and driving sustained reliability gains rather than one-off fixes.
Shape the developer platform strategy, identifying the highest-leverage investments in self-service tooling, CI/CD, and deployment automation. Set the quality bar for developer-facing infrastructure and ensure the team ships tooling that meaningfully accelerates engineering velocity.
Design cross-service observability architectures (metrics, logs, traces) with clear operational standards. Lead strategic alerting and runbook improvements that reduce mean-time-to-detect and mean-time-to-resolve across the org.
Drive systemic incident resilience: lead cross-team infrastructure incident response, identify recurring failure patterns, and own the follow-through that turns post-incident findings into durable infrastructure improvements. Proactively refocus team efforts when reliability projects are off-course or not delivering meaningful risk reduction.
Build engineering rigor into team processes, improving design review standards, deployment checklists, operational readiness criteria, and code quality practices. Set a high bar and coach the team to consistently meet it.
Design AI-assisted workflows for your team: identify high-leverage opportunities where AI tooling can remove bottlenecks or enable previously infeasible work. Set guardrails for responsible AI use in infrastructure operations, evaluate emerging AI capabilities, and coach engineers on developing strong AI judgment.
Requirements
8+ years of hands-on infrastructure engineering experience in production environments, with at least 2 years operating at a senior or staff level, including leading technical projects, setting direction for other engineers, and making architecture-level decisions.
Deep experience architecting, operating, and scaling infrastructure on AWS, with demonstrated depth across several of: EKS, S3, RDS, API Gateway, VPC, Load Balancers, Lambda, DocumentDB, DynamoDB. GCP experience is a plus.
Track record of defining infrastructure-as-code strategy at scale, including Terraform module architecture, governance patterns, and driving adoption of IaC standards across teams.
Strong Kubernetes and container platform experience, including designing cluster architectures, managing multi-tenant workloads, and operating production microservice deployments at scale.
Proven ability to design and improve CI/CD systems (GitHub Actions preferred) with a focus on deployment safety, velocity, and developer experience. Evidence of introducing new tooling or processes that measurably improved deployment outcomes.
Experience designing observability architectures for distributed systems (metrics, logs, traces) and using observability data to drive reliability improvements. Datadog experience is a plus.
Solid networking knowledge (DNS, load balancing, firewalls, VPNs, service mesh, service-to-service connectivity) and experience applying it to solve cross-service infrastructure problems.
Strong technical communication and influence skills: ability to write clear technical strategy documents, present architecture decisions to leadership, explain complex tradeoffs across teams, and align stakeholders on technical direction.
Proficient in at least one of Java, Python, or TypeScript, with demonstrated code review practice and a track record of raising code quality standards through review feedback and tooling.
Demonstrated leadership mindset: leads project teams end-to-end, proactively identifies and redirects off-course work, builds engineering rigor into team processes, and takes ownership of outcomes beyond individual deliverables.
Tech Stack
AWS
Cloud
Distributed Systems
DNS
DynamoDB
Firewalls
Google Cloud Platform
Java
Kubernetes
Python
Terraform
TypeScript
Benefits
Competitive medical, dental, and vision
Company equity
401(k) plan with company match
Unlimited paid time off + 13 company paid holidays
Parental leave
Flex Cares Program: Non-profit company match + pet adoption coverage