Penn Mutual is seeking a Senior Site Reliability Engineer (Senior SRE) to help evolve reliability practices across business-critical systems in a highly regulated financial services environment. This role is a hands-on technical leader responsible for designing, implementing, and advancing reliability across complex, distributed systems.
Responsibilities:
- Lead reliability availability, scalability, and recovery design for critical systems
- Define and evolve SLOs, SLIs, and error budget practices across services
- Identify systemic reliability risks and drive cross-team remediation efforts
- Influence application and platform architecture to improve operational outcomes
- Act as a technical lead during major incidents and complex outages
- Drive high-quality root cause analysis and recommend corrective actions
- Improve incident response processes, tooling, and runbooks
- Design and implement advanced automation to eliminate operational toil at scale
- Build and maintain shared SRE tooling and platforms
- Set engineering standards for reliability-focused code and operational practices
- Review and improve CI/CD, deployment, and rollback strategies
- Partner with Release and Change Management to automate release practices
- Lead risk assessments for high impact changes and releases
- Ensure compliance requirements are met without sacrificing engineering velocity
- Serve as a reliability authority for release readiness decisions
- Mentor junior SREs and junior engineers through technical guidance and review
- Lead by example in operational excellence and engineering rigor
- Influence reliability culture across engineering and product teams
Requirements:
- Bachelor's degree in Computer Science, Engineering, or related field
- 6–10+ years of experience in SRE, software engineering, platform, or DevOps roles
- Professional experience in performing root cause analysis on incidents, documenting SRE systems and usage
- Strong programming skills with professional experience in multiple languages
- Deep experience with AWS and distributed systems
- Advanced knowledge of observability, ITSM, and reliability engineering principles
- Proven ability to operate effectively in complex, regulated environments
- Experience with use/implementation of observability tools (metrics, logs, tracing)
- Experience with CI/CD pipelines and deployment automation
- Experience with Root Cause Analysis investigation/documentation
- Familiarity with containerization and orchestration technologies
- Strong troubleshooting and analytical skills
- Experience with IaC (CloudFormation)
- Experience with application frameworks (Spring, Spring Boot, React, Angular)
- Experience with application servers/containers (Tomcat, Netty, Node.JS, Next.JS)
- Experience with relational and non-relational databases and related ORM/drivers
- Experience working in Agile/Scrum environments
- Experience with ITSM tools (ServiceNow or similar)
- Experience with ITIL-aligned change and release processes
- Familiarity of security compliance frameworks (ISO 27001, SOC 2)