Responsible for enterprise production reliability, operational resilience, and disaster recovery governance within a regulated Financial Services environment.
Provides strategic and hands-on technical leadership across Incident Management, Problem Management, DevOps and IT Service Continuity Management (ITSCM).
Defines reliability standards, leads high-priority incident response, eliminates systemic risk through structured root cause remediation, governs the strategy and implementation of disaster recovery capabilities.
Partners closely with Engineering, Infrastructure, Development, Business, Risk, Compliance, and Internal Audit.
Serves as senior escalation authority for high-priority production incidents.
Drives blameless post-incident reviews and track remediation through formal governance processes.
Matures enterprise incident response frameworks and escalation models.
Owns the end-to-end Problem Management lifecycle.
Requirements
Minimum 10 years of relevant experience in Production Engineering, Disaster Recovery, DevOps, or Infrastructure Engineering.
Hands-on experience with tools: Datadog, New Relic, Elasticsearch, AWS CloudWatch; Incident Management: JIRA Service Management, ITSM practices; CI/CD Tools: TeamCity, Octopus Deploy, Bitbucket, GitHub, Azure DevOps; Infrastructure: AWS (EC2, S3, Lambda, ECS, IAM, CloudFormation or Terraform); Backup and disaster recovery (DR) Tool: Rubrik.
Strong programming/scripting ability in one or more: Python, Bash, PowerShell, Go.
Experience building dashboards, KPIs, and reports for engineering and executive audiences.
Extensive knowledge of SRE frameworks, including SLOs, SLIs, MTTR, error budgets, and fault tolerance.
Extensive knowledge of Data Engineering principles, data lifecycle management, and data quality governance frameworks, ensuring reliability, accuracy, and integrity of enterprise data assets.
Strong interpersonal, verbal and written communication, and organizational skills.
Ability to manage multiple priorities simultaneously and deal with ambiguity.
Familiarity with compliance frameworks (ISO 22301, FFIEC, SOC 2, ISO 27001, ITIL etc.) preferred.
Experience working in environments where the operations and infrastructure behind websites (WebOps) are managed alongside content management platforms and a strong focus is placed on site speed, reliability, and user experience preferred.
Experience utilizing Monitoring Tools like Datadog, Elasticsearch preferred.