Origami Risk is a company that provides integrated SaaS solutions to organizations across the risk and insurance ecosystem. They are seeking a Site Reliability Engineer to improve site reliability, advance scalability, and lead post-incident investigations while collaborating with cross-functional teams to implement system enhancements.
Responsibilities:
- Leads post-incident investigations for the Site Reliability team
- Conducts in-depth post-incident analyses to identify root causes and develops preventive strategies
- Drafts clear and insightful RCAs for customer delivery
- Cross trains colleagues on how to best leverage observability tools during incident and performance investigations
- Provides visibility to all stakeholders throughout the entire Site Reliability process
- Collaborates with cross-functional teams to implement system enhancements that enhance scalability and stability
- Develops client-focused dashboards/alerts to proactively identify performance challenges
- Monitors and continuously improves our time to resolution metrics
- Maintains and configures core observability tools to ensure optimum performance and key metrics/data are available for incident response and performance investigations
- Provides an actionable feedback loop to Observability and Engineering teams toward improving MELT and development patterns
- Contributes to the development of automation tools to streamline incident response
- Works proactively to prevent incidents and reduce their impact on our platform
- Partners with the larger Cloud Operations, SRE, Engineering teams, and the business-at-large to advance our SaaS platforms
- Participates in on-call rotation with other team members as needed
- Other duties as assigned
Requirements:
- Bachelor's degree in Computer Science or related field (or equivalent experience)
- 5+ years of proven experience in a Site Reliability Engineering role
- Strong knowledge of SRE best practices and incident management protocols
- Deep experience using and/or configuring New Relic, Data Dog, SumoLogic or similar observability tools
- Proficiency in reading and writing code (e.g., JavaScript, .NET, SQL)
- Familiarity with cloud platforms (e.g., AWS, Azure) and architectural patterns
- Excellent problem-solving skills and a data-driven approach to incident analysis
- Prior experience operating within a Public Cloud environment (AWS strongly preferred)
- Experience troubleshooting C#/.Net based web applications to identify bugs/performance challenges
- Solid knowledge of SaaS operations
- Ability to succeed when facing ambiguity and differing levels of operational maturation
- Advanced written and verbal communication skills
- Windows and SQL-server troubleshooting skills preferred
- Knowledge of Continuous Integration and Continuous Delivery (CI/CD) pipelines preferred
- Experience working in an Infrastructure as a Code (IaC) environment preferred
- Previous experience as a Software Engineer and/or System Administrator is a plus