AWSAzureCloudDockerGrafanaJenkinsKubernetesPythonTerraformGoBashGitHub ActionsGitLab CIPulumiHelmArgoCDCloudFormationDatadogNew RelicGitHubGitLabAgileScrumCI/CDLeadershipCommunicationCollaborationRemote Work
About this role
Role Overview
Own uptime, availability, scalability, and performance of all production systems.
Define and manage SLOs, SLAs, error budgets, and incident response practices.
Lead post-incident reviews and drive systemic reliability improvements.