GitLab is the intelligent orchestration platform for DevSecOps, and they are seeking a Senior Backend Engineer to help build the foundation for scaling GitLab.com through their Cells architecture. In this role, you will work on edge routing services and the Topology Service, ensuring reliable and low-latency routing across protocols while collaborating with various teams to enhance the platform.
Responsibilities:
- Design and implement edge traffic routing that directs requests to the correct Cell in a way that's transparent to users
- Build and evolve the Topology Service that serves as the authoritative source of cluster state for routing, resource assignment, and Cell lifecycle decisions
- Collaborate across the GitLab Rails monolith and supporting services to make features and data models Cell-aware with feature teams across the product
- Operate and improve the routing and topology systems you build by participating in tier-2 on-call, responding to escalated incidents, and strengthening observability and operational tooling
- Author Architecture Decision Records (ADRs), operational runbooks, and documentation so other teams can understand, adopt, and extend the Cells platform
- Review merge requests from GitLab team members and community contributors, maintaining high standards for correctness, performance, and security across the stack
Requirements:
- Experience building observable, resilient production services using Go or Ruby on Rails (TypeScript experience is a plus)
- Background delivering and operating production systems in high-scale environments, including incident response and operational ownership
- Ability to reason about distributed systems, including consistency models, partitioning strategies, failure modes, and operational tradeoffs
- Experience building high-throughput networking services (gRPC and protocol buffers knowledge is a plus)
- Familiarity working in large, multi-team codebases and coordinating changes across teams and services, including making features and data models Cell-aware
- Knowledge of observability practices such as metrics, tracing, and alerting, with an approach focused on building systems you'd be confident operating on-call
- Strong written communication skills for an async-first, globally distributed team, including documenting decisions (for example, architecture decision records) and runbooks
- Experience working with relational databases in production, including schema design, migrations, and query performance tuning (PostgreSQL experience is a plus)