Origami Risk is a company that provides integrated SaaS solutions to organizations across the risk and insurance ecosystem. They are seeking a Site Reliability Engineer to improve site reliability, advance scalability, and lead post-incident investigations while collaborating with cross-functional teams to implement system enhancements.

Responsibilities:

Leads post-incident investigations for the Site Reliability team
Conducts in-depth post-incident analyses to identify root causes and develops preventive strategies
Drafts clear and insightful RCAs for customer delivery
Cross trains colleagues on how to best leverage observability tools during incident and performance investigations
Provides visibility to all stakeholders throughout the entire Site Reliability process
Collaborates with cross-functional teams to implement system enhancements that enhance scalability and stability
Develops client-focused dashboards/alerts to proactively identify performance challenges
Monitors and continuously improves our time to resolution metrics
Maintains and configures core observability tools to ensure optimum performance and key metrics/data are available for incident response and performance investigations
Provides an actionable feedback loop to Observability and Engineering teams toward improving MELT and development patterns
Contributes to the development of automation tools to streamline incident response
Works proactively to prevent incidents and reduce their impact on our platform
Partners with the larger Cloud Operations, SRE, Engineering teams, and the business-at-large to advance our SaaS platforms
Participates in on-call rotation with other team members as needed
Other duties as assigned

Requirements:

Bachelor's degree in Computer Science or related field (or equivalent experience)
5+ years of proven experience in a Site Reliability Engineering role
Strong knowledge of SRE best practices and incident management protocols
Deep experience using and/or configuring New Relic, Data Dog, SumoLogic or similar observability tools
Proficiency in reading and writing code (e.g., JavaScript, .NET, SQL)
Familiarity with cloud platforms (e.g., AWS, Azure) and architectural patterns
Excellent problem-solving skills and a data-driven approach to incident analysis
Prior experience operating within a Public Cloud environment (AWS strongly preferred)
Experience troubleshooting C#/.Net based web applications to identify bugs/performance challenges
Solid knowledge of SaaS operations
Ability to succeed when facing ambiguity and differing levels of operational maturation
Advanced written and verbal communication skills
Windows and SQL-server troubleshooting skills preferred
Knowledge of Continuous Integration and Continuous Delivery (CI/CD) pipelines preferred
Experience working in an Infrastructure as a Code (IaC) environment preferred
Previous experience as a Software Engineer and/or System Administrator is a plus

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: