Empower is focused on transforming financial lives through a flexible work environment and a commitment to inclusivity. The Lead Site Reliability Engineer will leverage technical expertise and leadership skills to enhance the reliability of Empower’s financial services platform, guiding SRE teams and establishing best practices across the organization.

Responsibilities:

Lead cross-functional reliability initiatives across multiple value streams and coordinate execution across teams
Define and evolve SRE best practices, tools, and methodologies across the organization
Architect enterprise-scale, multi-region AWS infrastructure that balances reliability, cost, performance, and security
Establish and operate SLOs, SLIs, and error budgets for critical services, using them to drive prioritization decisions
Serve as incident commander for major incidents and drive postmortems that produce completed action items and organizational learning
Lead disaster recovery planning for critical financial services infrastructure
Build shared Infrastructure as Code foundations in Terraform (reusable modules, standards, and patterns adopted across teams)
Design and implement production-scale Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling
Establish observability standards and strategies using Datadog and Splunk (metrics, logging, tracing, dashboards, and alerting)
Set CI/CD standards and patterns, including pipeline-as-code and progressive delivery at scale
Lead chaos engineering, game days, and systematic reliability testing initiatives
Drive FinOps initiatives to optimize cloud spend while maintaining reliability targets
Lead a functional team of SREs (without direct reports) on projects and operational initiatives
Mentor SREs at multiple levels through coaching, design reviews, code reviews, and training sessions
Partner with Engineering, Product, and Security leadership to align reliability work with business priorities, zero-trust architecture, and compliance controls

Requirements:

Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent practical experience)
7 to 10 years of Site Reliability Engineering experience (or equivalent), with demonstrated technical leadership
Proven ability to lead technical teams and drive complex projects to completion
Expert AWS knowledge, including designing large-scale, multi-region architectures
Deep Kubernetes expertise, including advanced features, security, and production-scale operations
Mastery of Infrastructure as Code using Terraform, including building shared platforms and frameworks
Strong software engineering background with production experience in Python and/or Go
Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale
Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines
Proven track record leading major incidents and conducting effective postmortems
Strong understanding of security, networking, and infrastructure design patterns
Strong communication skills with ability to explain complex technical concepts to diverse audiences
Experience mentoring engineers and building technical capabilities in teams
Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence
Financial services industry experience with understanding of regulatory requirements
Expertise in compliance frameworks (SOC 2, PCI DSS, FINRA)
AWS certifications (Professional level)
Kubernetes certifications (CKA, CKAD, CKS)
Experience implementing SRE at organizations with 500+ engineers
Background in chaos engineering, game days, and reliability testing practices
Contributions to open-source projects with demonstrated community leadership
Experience with service mesh implementation and management
Track record of speaking at conferences or writing technical content

Lead Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: