Penn Mutual is seeking a Senior Site Reliability Engineer (Senior SRE) to help evolve reliability practices across business-critical systems in a highly regulated financial services environment. This role is a hands-on technical leader responsible for designing, implementing, and advancing reliability across complex, distributed systems.

Responsibilities:

Lead reliability availability, scalability, and recovery design for critical systems
Define and evolve SLOs, SLIs, and error budget practices across services
Identify systemic reliability risks and drive cross-team remediation efforts
Influence application and platform architecture to improve operational outcomes
Act as a technical lead during major incidents and complex outages
Drive high-quality root cause analysis and recommend corrective actions
Improve incident response processes, tooling, and runbooks
Design and implement advanced automation to eliminate operational toil at scale
Build and maintain shared SRE tooling and platforms
Set engineering standards for reliability-focused code and operational practices
Review and improve CI/CD, deployment, and rollback strategies
Partner with Release and Change Management to automate release practices
Lead risk assessments for high impact changes and releases
Ensure compliance requirements are met without sacrificing engineering velocity
Serve as a reliability authority for release readiness decisions
Mentor junior SREs and junior engineers through technical guidance and review
Lead by example in operational excellence and engineering rigor
Influence reliability culture across engineering and product teams

Requirements:

Bachelor's degree in Computer Science, Engineering, or related field
6–10+ years of experience in SRE, software engineering, platform, or DevOps roles
Professional experience in performing root cause analysis on incidents, documenting SRE systems and usage
Strong programming skills with professional experience in multiple languages
Deep experience with AWS and distributed systems
Advanced knowledge of observability, ITSM, and reliability engineering principles
Proven ability to operate effectively in complex, regulated environments
Experience with use/implementation of observability tools (metrics, logs, tracing)
Experience with CI/CD pipelines and deployment automation
Experience with Root Cause Analysis investigation/documentation
Familiarity with containerization and orchestration technologies
Strong troubleshooting and analytical skills
Experience with IaC (CloudFormation)
Experience with application frameworks (Spring, Spring Boot, React, Angular)
Experience with application servers/containers (Tomcat, Netty, Node.JS, Next.JS)
Experience with relational and non-relational databases and related ORM/drivers
Experience working in Agile/Scrum environments
Experience with ITSM tools (ServiceNow or similar)
Experience with ITIL-aligned change and release processes
Familiarity of security compliance frameworks (ISO 27001, SOC 2)

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: