GitLab is the intelligent orchestration platform for DevSecOps, enabling organizations to increase developer productivity and improve operational efficiency. As a Senior Site Reliability Engineer, you will focus on automating the lifecycle of GitLab environments, ensuring they remain secure, consistent, and reliable at scale.
Responsibilities:
- Build & Scale Multi-Tenant Infrastructure: Design and implement automation that provisions and manages hundreds of isolated GitLab environments using Terraform, Ansible, and Kubernetes. Manage complex state strategies and workspace configurations to support scale and maintainability
- Debug & Resolve Production Issues: Troubleshoot issues across Kubernetes clusters, cloud services, and GitLab apps—identifying root causes of failed deployments, crash loops, and scheduling conflicts to ensure service continuity
- Automate Operations at Scale: Replace manual workflows with infrastructure-as-code solutions, including automated version upgrades, configuration rollouts, and provisioning pipelines that operate reliably across all tenants
- Monitor & Predict Capacity: Build observability systems that detect bottlenecks, predict usage trends, and optimize resource consumption using tools like Prometheus, ELK, and Grafana
- Respond & Lead During Incidents: Lead incident response and postmortem efforts, applying technical depth to resolve issues and establish operational standards that reduce future risk
- Architect & Collaborate: Influence architectural decisions around automation, scalability, and operational excellence. Partner with engineering teams to improve automation, platform resilience, and production-readiness
Requirements:
- Proven ability to operate and troubleshoot production workloads across multiple tenants or environments
- Deep understanding of how distributed systems fail at scale and how to build in resilience
- Strong hands-on experience with Terraform, including workspace strategies, state management, and automation patterns that scale
- Skilled at diagnosing deployment failures, interpreting pod logs, and debugging scheduling issues and rollback scenarios in live environments
- Ability to read and debug code in Go and/or Ruby
- Experience supporting infrastructure for many customers or environments simultaneously
- Able to reason through complex systems and operational challenges
- Proven ability to work across teams and with internal or external customers to solve technical problems while maintaining service commitments and clear communication
- Comfortable using GitLab as a daily tool for infrastructure automation, collaboration, and operational workflows
- Experience with Ansible and templating tools like Jsonnet
- Brings on-call experience and can lead technical discussions and incident resolution efforts under pressure