Empower is focused on transforming financial lives through a flexible work environment and a commitment to inclusivity. The Lead Site Reliability Engineer will leverage technical expertise and leadership skills to enhance the reliability of Empower’s financial services platform, guiding SRE teams and establishing best practices across the organization.
Responsibilities:
- Lead cross-functional reliability initiatives across multiple value streams and coordinate execution across teams
- Define and evolve SRE best practices, tools, and methodologies across the organization
- Architect enterprise-scale, multi-region AWS infrastructure that balances reliability, cost, performance, and security
- Establish and operate SLOs, SLIs, and error budgets for critical services, using them to drive prioritization decisions
- Serve as incident commander for major incidents and drive postmortems that produce completed action items and organizational learning
- Lead disaster recovery planning for critical financial services infrastructure
- Build shared Infrastructure as Code foundations in Terraform (reusable modules, standards, and patterns adopted across teams)
- Design and implement production-scale Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling
- Establish observability standards and strategies using Datadog and Splunk (metrics, logging, tracing, dashboards, and alerting)
- Set CI/CD standards and patterns, including pipeline-as-code and progressive delivery at scale
- Lead chaos engineering, game days, and systematic reliability testing initiatives
- Drive FinOps initiatives to optimize cloud spend while maintaining reliability targets
- Lead a functional team of SREs (without direct reports) on projects and operational initiatives
- Mentor SREs at multiple levels through coaching, design reviews, code reviews, and training sessions
- Partner with Engineering, Product, and Security leadership to align reliability work with business priorities, zero-trust architecture, and compliance controls
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent practical experience)
- 7 to 10 years of Site Reliability Engineering experience (or equivalent), with demonstrated technical leadership
- Proven ability to lead technical teams and drive complex projects to completion
- Expert AWS knowledge, including designing large-scale, multi-region architectures
- Deep Kubernetes expertise, including advanced features, security, and production-scale operations
- Mastery of Infrastructure as Code using Terraform, including building shared platforms and frameworks
- Strong software engineering background with production experience in Python and/or Go
- Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale
- Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines
- Proven track record leading major incidents and conducting effective postmortems
- Strong understanding of security, networking, and infrastructure design patterns
- Strong communication skills with ability to explain complex technical concepts to diverse audiences
- Experience mentoring engineers and building technical capabilities in teams
- Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence
- Financial services industry experience with understanding of regulatory requirements
- Expertise in compliance frameworks (SOC 2, PCI DSS, FINRA)
- AWS certifications (Professional level)
- Kubernetes certifications (CKA, CKAD, CKS)
- Experience implementing SRE at organizations with 500+ engineers
- Background in chaos engineering, game days, and reliability testing practices
- Contributions to open-source projects with demonstrated community leadership
- Experience with service mesh implementation and management
- Track record of speaking at conferences or writing technical content