Eltropy is on a mission to disrupt the way people access financial services, and they are seeking a Director of Site Reliability Engineering to lead and scale their SRE function. This role involves ensuring the reliability, availability, performance, and efficiency of critical systems while partnering with various teams to build resilient platforms that support business growth.

Responsibilities:

Define and execute the SRE vision, strategy, and roadmap aligned with business objectives
Build, mentor, and lead a high-performing team of SRE managers and engineers
Establish best practices for reliability, incident management, change management, and capacity planning
Serve as a senior technical leader and trusted advisor across the organization
Own system reliability metrics, including SLIs, SLOs, and error budgets
Lead major incident response, post-incident reviews, and long-term remediation efforts
Drive improvements in uptime, latency, scalability, and fault tolerance across
Influence system architecture to improve resilience, scalability, and operability
Champion automation, Infrastructure as Code, and self-service platforms
Oversee observability strategy (monitoring, logging, tracing, alerting)
Ensure systems are designed for high availability, disaster recovery, and business continuity
Partner with Product, Platform, Security, and Compliance teams to meet operational and regulatory requirements
Define operational standards, runbooks, and on-call practices
Communicate reliability risks, tradeoffs, and performance to executive leadership

Requirements:

10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering
5+ years in engineering leadership roles
Strong background in distributed systems, cloud platforms (AWS, GCP, Azure), and container orchestration (Kubernetes)
Hands-on experience with CI/CD, Infrastructure as Code (e.g., Terraform, CloudFormation), and automation
Proven experience defining and operating SLOs, SLIs, and error budgets
Excellent incident management and root cause analysis skills
Strong communication skills with the ability to influence technical and non-technical stakeholders
Experience supporting large-scale, high-traffic, or mission-critical systems
Background in software engineering or systems engineering
Experience scaling SRE practices in a fast-growing organization
Familiarity with security, compliance, and regulatory requirements
Bachelor's or Master's degree in Computer Science or a related field (or equivalent experience)

Senior Manager, Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: