WEX is a leading company in the mobility sector, seeking a Senior Director of Site Reliability Engineering to lead their engineering efforts in ensuring the resilience of their global mobility platform. This role involves strategic leadership in defining the SRE roadmap, overseeing infrastructure for millions of concurrent trips, managing incidents, and optimizing cloud efficiency while fostering a productive engineering team.
Responsibilities:
- Define the multi-year SRE roadmap, pivoting from reactive firefighting to proactive, automated platform health
- Oversee infrastructure that supports millions of concurrent trips across diverse geographic regions, accounting for local regulatory and latency requirements
- Own the end-to-end incident lifecycle. You won't just manage the 'Big Outages'; you’ll foster a blameless culture focused on root-cause analysis (RCA) and permanent remediation
- Deploy AI/ML models to analyze historical telemetry data to predict capacity 'hotspots' and system fatigue hours before they manifest
- Partner with Product and Engineering VPs to balance innovation speed with reliability via strictly enforced Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
- Work closely with product and commercial partners to drive, prioritize, and work backwards from the customer requirements and exceed expected outcomes
- Drive effective monthly, weekly, and quarterly mechanisms to plan, execute, and audit workstreams
- Optimize a massive global cloud footprint (AWS/GCP/Azure), ensuring performance doesn't come at the cost of unsustainable burn
- Champion 'Infrastructure as Code' (IaC) and self-service tooling so that developers can deploy safely without manual intervention
- Establish a robust and clear engineering roadmap to maintain clarity and motivation for the engineering team. Maintain career growth plans and provide monthly and quarterly feedback for individuals’ continual progress
- Establish measurement of metrics-driven dev productivity across Mobility SRE org
- Comfortably present, influence, and communicate to the senior leadership team. Provide regular updates and insights to senior leadership on the challenges and opportunities within the Mobility domain. Effectively manage up, across, and down with tangible written strategy documents or plans
Requirements:
- BS/MS in Computer Science, Engineering, or equivalent practical experience
- 12+ years in SRE, with at least 5 years in a senior leadership role (Director or above) managing managers
- Proven track record of managing distributed systems at a 'Hyper-scale' level (e.g., millions of requests per second)
- Expertise in rapid development and deployment using cloud computing platforms such as AWS or Azure
- Deep understanding of Kubernetes, service mesh (Istio/Linkerd), edge computing, and global traffic management
- Excellent leadership, team-building, and dynamic decision-making skills
- Ability to deal with ambiguity and thrive in a fast-paced, dynamic environment
- Excellent verbal and written communication skills
- Experience with high-concurrency, geospatial, or real-time marketplace dynamics is a significant plus
- Experience building high-performance distributed systems at internet-scale companies
- Experience building credit card products, or experience developing solutions in a scheme/network
- Experience building or managing fleet systems
- Experience working on closed-loop card systems