GitLab is an open-core software company that develops a comprehensive AI-powered DevSecOps Platform. The Site Reliability Engineer (SRE) will ensure the smooth operation of GitLab.com and other production systems by applying strong software engineering practices and managing scalable, reliable, and secure infrastructure.
Responsibilities:
- Design and implement highly scalable infrastructure for GitLab.com to support current and future growth
- Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction
- Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department
- Participate in a global on-call rotation during your local daytime hours, respond to production incidents, and contribute to clear, constructive incident reviews
- Reduce toil by automating operational tasks and building tools that improve reliability, availability, and scalability
- Apply infrastructure as code and configuration management practices to manage cloud resources and environments consistently
- Write and maintain production-quality code, preferably in Go or Ruby, to enhance our systems and automation toolchain
Requirements:
- Background working with the Kubernetes ecosystem, including tools such as Helm, and running production workloads
- Experience operating cloud infrastructure on platforms like Google Cloud Platform or Amazon Web Services, especially networking, hosted Kubernetes services, and scaling
- Hands-on practice with infrastructure as code and configuration management tools such as Ansible or Chef
- Strong programming skills in a modern language, preferably Go or Ruby, applied to automation and reliability problems
- Ability to clearly define problems, think beyond short-term fixes, and design solutions that improve systems over time
- Consistent focus on reducing toil through automation and thoughtful system design
- Independent, proactive working style with a bias for action and comfort operating as a 'manager of one' in a distributed, asynchronous environment
- Clear written and verbal communication skills, with openness to candidates who bring transferable experience from related reliability, infrastructure, or platform roles