Eltropy is on a mission to disrupt the way people access financial services. They are seeking a Senior Manager of Site Reliability Engineering to lead and scale their SRE function, ensuring the reliability and performance of critical systems while fostering a culture of automation and continuous improvement.
Responsibilities:
- Define and execute the SRE vision, strategy, and roadmap aligned with business objectives
- Build, mentor, and lead a high-performing team of SRE managers and engineers
- Establish best practices for reliability, incident management, change management, and capacity planning
- Serve as a senior technical leader and trusted advisor across the organization
- Own system reliability metrics, including SLIs, SLOs, and error budgets
- Lead major incident response, post-incident reviews, and long-term remediation efforts
- Drive improvements in uptime, latency, scalability, and fault tolerance across
- Influence system architecture to improve resilience, scalability, and operability
- Champion automation, Infrastructure as Code, and self-service platforms
- Oversee observability strategy (monitoring, logging, tracing, alerting)
- Ensure systems are designed for high availability, disaster recovery, and business continuity
- Partner with Product, Platform, Security, and Compliance teams to meet operational and regulatory requirements
- Define operational standards, runbooks, and on-call practices
- Communicate reliability risks, tradeoffs, and performance to executive leadership
Requirements:
- 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering
- 3+ years in engineering leadership roles
- Strong background in distributed systems, cloud platforms (AWS, GCP, Azure), and container orchestration (Kubernetes)
- Hands-on experience with CI/CD, Infrastructure as Code (e.g., Terraform, CloudFormation), and automation
- Proven experience defining and operating SLOs, SLIs, and error budgets
- Excellent incident management and root cause analysis skills
- Strong communication skills with the ability to influence technical and non-technical stakeholders
- Experience supporting large-scale, high-traffic, or mission-critical systems
- Background in software engineering or systems engineering
- Experience scaling SRE practices in a fast-growing organization
- Familiarity with security, compliance, and regulatory requirements
- Bachelor's or Master's degree in Computer Science or a related field (or equivalent experience)