Role Overview

Responsible for enterprise production reliability, operational resilience, and disaster recovery governance within a regulated Financial Services environment.
Provides strategic and hands-on technical leadership across Incident Management, Problem Management, DevOps and IT Service Continuity Management (ITSCM).
Defines reliability standards, leads high-priority incident response, eliminates systemic risk through structured root cause remediation, governs the strategy and implementation of disaster recovery capabilities.
Partners closely with Engineering, Infrastructure, Development, Business, Risk, Compliance, and Internal Audit.
Serves as senior escalation authority for high-priority production incidents.
Drives blameless post-incident reviews and track remediation through formal governance processes.
Matures enterprise incident response frameworks and escalation models.
Owns the end-to-end Problem Management lifecycle.

Requirements

Minimum 10 years of relevant experience in Production Engineering, Disaster Recovery, DevOps, or Infrastructure Engineering.
Hands-on experience with tools: Datadog, New Relic, Elasticsearch, AWS CloudWatch; Incident Management: JIRA Service Management, ITSM practices; CI/CD Tools: TeamCity, Octopus Deploy, Bitbucket, GitHub, Azure DevOps; Infrastructure: AWS (EC2, S3, Lambda, ECS, IAM, CloudFormation or Terraform); Backup and disaster recovery (DR) Tool: Rubrik.
Strong programming/scripting ability in one or more: Python, Bash, PowerShell, Go.
Experience building dashboards, KPIs, and reports for engineering and executive audiences.
Extensive knowledge of SRE frameworks, including SLOs, SLIs, MTTR, error budgets, and fault tolerance.
Extensive knowledge of Data Engineering principles, data lifecycle management, and data quality governance frameworks, ensuring reliability, accuracy, and integrity of enterprise data assets.
Strong interpersonal, verbal and written communication, and organizational skills.
Ability to manage multiple priorities simultaneously and deal with ambiguity.
Familiarity with compliance frameworks (ISO 22301, FFIEC, SOC 2, ISO 27001, ITIL etc.) preferred.
Experience working in environments where the operations and infrastructure behind websites (WebOps) are managed alongside content management platforms and a strong focus is placed on site speed, reliability, and user experience preferred.
Experience utilizing Monitoring Tools like Datadog, Elasticsearch preferred.

Tech Stack

AWS
Azure
EC2
ElasticSearch
ITSM
Python
Terraform
Go

Benefits

health
dental
vision
life insurance
paid time-off benefits
flexible spending account
401(k) with employer match
ESPP

Principal Production Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits