GitLab is the intelligent orchestration platform for DevSecOps, enabling organizations to enhance developer productivity and operational efficiency. As a Senior Engineer on the Runway team, you will lead the design and operation of the Kubernetes-based platform, drive strategic initiatives, and mentor engineers to improve developer experience and platform reliability.
Responsibilities:
- Lead the operation and evolution of production-grade Kubernetes clusters across cloud environments, making architectural decisions on upgrades, scaling, disaster recovery, and reliability improvements that impact the entire organization
- Define and drive GitOps strategy and standards across the organization, owning ArgoCD-based workflows by architecting Application Sets, sync policies, and deployment standards, and mentoring teams on GitOps best practices
- Architect and establish Terraform-based infrastructure-as-code standards across teams, building reusable modules and practices that enable safe, scalable cloud infrastructure provisioning while establishing clear patterns for state management and drift detection
- Lead platform observability strategy and incident response processes, set standards for monitoring and post-incident reviews, and drive organization-wide improvements to availability, performance, and resilience
- Partner with and mentor application teams to onboard services onto the platform, establishing patterns for documentation, runbooks, and self-service tooling that scale across the organization and improve developer productivity
- Design and establish security control standards such as role-based access control (RBAC), network policies, and secrets management (for example, Vault, Sealed Secrets, or External Secrets Operator) that meet compliance requirements and scale across the organization
- Drive integration of platform capabilities with continuous integration pipelines (for example, GitHub Actions, GitLab CI, or Tekton) to establish end-to-end delivery workflows that set standards across the organization
Requirements:
- Experience operating and evolving production Kubernetes clusters (upgrades, scaling, disaster recovery, reliability) across one or more cloud environments (for example, Amazon EKS, Google GKE, or Azure AKS)
- Experience designing and running GitOps-based continuous delivery workflows with ArgoCD, Flux, or similar tools; able to establish and maintain deployment standards across environments
- Experience with infrastructure as code (Terraform or equivalent), including reusable modules, state management, and drift detection practices for safe infrastructure provisioning
- Ability to write and maintain automation using a scripting language (for example, Python, Bash, or Go) and guide others on best practices
- Working knowledge of networking fundamentals (DNS, load balancing, ingress) and related platform patterns (for example, service mesh) to design reliable network architectures
- Strong written and verbal communication skills, including mentoring, writing clear system documentation, and establishing runbooks and best practices across teams