Lead, mentor, and develop a team of L2/L3 Production Support Engineers.
Define, track, and optimise key operational metrics including Mean Time to Resolution (MTTR), Mean Time Between Failures (MTBF).
Lead the diagnosis and resolution of application-level issues using software development techniques and best practices.
Establish and refine incident management processes. Lead critical incident resolution and coordinate cross-functional response efforts.
Champion rigorous RCA practices across the team.
Identify opportunities to streamline support workflows, reduce manual effort through automation, and eliminate toil.
Serve as the primary technical contact for internal and external stakeholders.
Maintain oversight of production SaaS platforms, infrastructure stability, and system performance.
Requirements
Minimum 6+ years in production support, DevOps, or Site Reliability Engineering roles, with at least 3 years leading or mentoring technical teams.
Proven experience troubleshooting application code issues using software development techniques: debuggers, profilers, log analysis, code review, and systematic problem-solving methodologies.
Demonstrated expertise building and scaling metrics-driven teams. Evidence of implementing or improving MTTR, MTBF, or similar KPIs with measurable results.
Strong background supporting SaaS/cloud-native production systems in high-availability, high-traffic environments.