Finance of America is a company focused on financial services, and they are seeking a Principal Production Engineer to ensure enterprise production reliability and operational resilience. The role involves leading incident management, problem management, and disaster recovery efforts while collaborating with various teams to maintain compliance and protect customer trust.

Responsibilities:

Serves as senior escalation authority for high-priority production incidents
Leads coordinated response efforts to restore services within defined Service Level Objectives (SLOs)
Ensures documented impact assessments for financially significant systems
Drives blameless post-incident reviews and track remediation through formal governance processes
Matures enterprise incident response frameworks and escalation models
Partners with Change Enablement, Risk, and Engineering teams to reduce production risk and improve service stability
Owns the end-to-end Problem Management lifecycle, including root cause analysis, known error documentation, and permanent corrective actions
Identifies systemic control weaknesses and drives remediation to prevent repeat incidents
Establishes structured reporting on recurring incidents, MTTR trends, and control effectiveness
Defines and governs enterprise Disaster Recovery (DR) strategy, and plans
Ensures alignment of Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) with Business Impact Analysis (BIA)
Leads annual and periodic DR testing exercises, validation, and evidence documentation, and reports DR readiness and resilience metrics to senior leadership
Coordinates with Business Continuity, Third-Party Risk, and Infrastructure teams to strengthen operational resilience
Establishes enterprise production reliability standards and resilience frameworks
Architects and enhances observability across applications and infrastructure (metrics, logs, traces)
Defines and monitors SLIs/SLOs, availability targets, and error budgets
Drives automation to reduce operational toil and improve system scalability
Partners in implementation and optimization of monitoring platforms (e.g., Datadog, New Relic, Elasticsearch, AWS native tools)
Integrates monitoring and alerting workflows with ITSM platforms (e.g., Jira Service Management) for automated ticketing and escalation
Ensures Incident, Problem, Change, and DR processes support SOX ITGC design and operating effectiveness
Maintains audit-ready documentation and evidence for regulatory and internal audit reviews
Participates in control walkthroughs and audit engagements
Identifies and remediates production control gaps impacting financial systems
Establishes resilience metrics aligned to enterprise risk appetite and regulatory expectations
Responds promptly and effectively to urgent business matters as they arise
Performs other duties as assigned

Requirements:

Minimum 10 years of relevant experience in Production Engineering, Disaster Recovery, DevOps, or Infrastructure Engineering
Hands-on experience with the following tools and technologies, or comparable platforms: Observability: Datadog, New Relic, Elasticsearch, AWS CloudWatch; Incident Management: JIRA Service Management, ITSM practices; CI/CD Tools: TeamCity, Octopus Deploy, Bitbucket, GitHub, Azure DevOps; Infrastructure: AWS (EC2, S3, Lambda, ECS, IAM, CloudFormation or Terraform); Backup and disaster recovery (DR) Tool: Rubrik
Strong programming/scripting ability in one or more: Python, Bash, PowerShell, Go
Experience building dashboards, KPIs, and reports for engineering and executive audiences
Extensive knowledge of SRE frameworks, including SLOs, SLIs, MTTR, error budgets, and fault tolerance
Extensive knowledge of Data Engineering principles, data lifecycle management, and data quality governance frameworks, ensuring reliability, accuracy, and integrity of enterprise data assets
Strong interpersonal, verbal and written communication, and organizational skills
Ability to manage multiple priorities simultaneously and deal with ambiguity
Bachelor's Degree
Computer Science or related technical field
Vendor or industry standard certifications in applicable specialty or related technology areas
Familiarity with one or more of compliance frameworks (ISO 22301, FFIEC, SOC 2, ISO 27001, ITIL etc.) preferred
Experience working in environments where the operations and infrastructure behind websites (WebOps) are managed alongside content management platforms (e.g., Pantheon or WordPress) and a strong focus is placed on site speed, reliability, and user experience preferred
Experience utilizing Monitoring Tools like Datadog, Elasticsearch preferred
Master's Degree

Principal Production Engineer

Key skills

About this role

Responsibilities:

Requirements: