Role Overview

Partner with application and platform teams to embed reliability into system design, development, and operations
Support implementation and operationalization of Service Level Objectives and reliability indicators
Contribute to improving observability coverage across logs, metrics, traces, and events
Apply reliability patterns such as fault isolation, failover, and recovery mechanisms in collaboration with engineering teams
Participate in and support improvements to the incident lifecycle, including detection, response, root cause analysis, and follow-up actions
Assist in identifying reliability risks and performance bottlenecks and contribute to remediation efforts
Support continuous improvement initiatives focused on reducing incident volume and improving system stability
Apply established enterprise standards for observability, resilience engineering, and Service Level Objectives
Support adoption of reliability practices across teams through hands-on guidance and collaboration
Contribute feedback to help evolve reliability frameworks and tooling
Develop and enhance automation for incident response, monitoring, and operational workflows
Leverage existing platforms (e.g., observability tools, incident management systems) to improve efficiency and visibility
Utilize AI-enabled capabilities where appropriate to support diagnostics and operational workflows under defined governance
Work closely with product, platform, and ITSM teams to align on reliability improvements
Participate in cross-team initiatives focused on improving system resilience and operational maturity
Contribute to knowledge sharing within the reliability engineering community

Requirements

Experience in one or more of the following: system integration, software development, system administration, or operations engineering
Familiarity with software development life cycle (SDLC) and production support models
Understanding of monitoring, observability, and performance optimization concepts
Experience supporting applications in cloud and/or on-premises environments
Working knowledge of CI/CD pipelines and deployment practices
Basic understanding of incident management and root cause analysis processes
Knowledge of system reliability principles, including availability and performance engineering
Strong problem-solving skills with a focus on continuous improvement
Ability to collaborate effectively across engineering and operations teams

Tech Stack

Cloud
ITSM
SDLC

Benefits

Flexible Work Arrangements

Staff Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits