Intermediate Site Reliability Engineer, Tenant Scale: Tenant Services at GitLab

GitLab is an open-core software company that develops a comprehensive AI-powered DevSecOps Platform. The Site Reliability Engineer (SRE) will ensure the smooth operation of GitLab.com and other production systems by applying strong software engineering practices and managing scalable, reliable, and secure infrastructure.

Responsibilities:

Design and implement highly scalable infrastructure for GitLab.com to support current and future growth
Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction
Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department
Participate in a global on-call rotation during your local daytime hours, respond to production incidents, and contribute to clear, constructive incident reviews
Reduce toil by automating operational tasks and building tools that improve reliability, availability, and scalability
Apply infrastructure as code and configuration management practices to manage cloud resources and environments consistently
Write and maintain production-quality code, preferably in Go or Ruby, to enhance our systems and automation toolchain

Requirements:

Background working with the Kubernetes ecosystem, including tools such as Helm, and running production workloads
Experience operating cloud infrastructure on platforms like Google Cloud Platform or Amazon Web Services, especially networking, hosted Kubernetes services, and scaling
Hands-on practice with infrastructure as code and configuration management tools such as Ansible or Chef
Strong programming skills in a modern language, preferably Go or Ruby, applied to automation and reliability problems
Ability to clearly define problems, think beyond short-term fixes, and design solutions that improve systems over time
Consistent focus on reducing toil through automation and thoughtful system design
Independent, proactive working style with a bias for action and comfort operating as a 'manager of one' in a distributed, asynchronous environment
Clear written and verbal communication skills, with openness to candidates who bring transferable experience from related reliability, infrastructure, or platform roles

Intermediate Site Reliability Engineer, Tenant Scale: Tenant Services

Key skills

About this role

Responsibilities:

Requirements: