Rocket.Chat is the world's largest open-source communications platform, seeking a Senior Staff Site Reliability Engineer to lead the reliability and operational excellence of their infrastructure. The role involves overseeing cloud infrastructure, Kubernetes platforms, and ensuring systems run efficiently across global deployments while collaborating with various teams to support product growth and operational resilience.

Responsibilities:

Influence core product architecture (Core, Fleetcommand, Omnichannel, etc.) before code is written to ensure reliability, scalability, and operability are baked in by design
Lead the engineering of systemic solutions that eliminate entire classes of failures, moving the organization from reactive firefighting to proactive prevention
Act as the technical visionary for our deployment products (LaunchControl, Airlock, Launchpad), defining the long-term technical roadmap and architectural standards alongside the Head of Infrastructure
Design, prototype, and build foundational tooling, core libraries, and frameworks (in Go, Python, etc.) that make it easier for both SREs and Product Engineers to deploy safely, monitor accurately, and operate efficiently
Champion and evolve the Infrastructure as Code (IaC) paradigms (Pulumi, Terraform) to ensure they meet the needs of increasingly complex, multi-region, and air-gapped enterprise deployments
Serve as the highest level of technical escalation for catastrophic, multi-domain Sev-1 incidents that baffle standard operational protocols
Drive the strategic direction of incident management, ensuring that post-mortems result in structural, org-wide improvements rather than localized band-aids
Evolve the company's disaster recovery (DR) and chaos engineering programs to simulate and defend against complex cascading failures
Define, document, and enforce global standards for observability (SLIs, SLOs, error budgets), alerting, and production readiness across all engineering squads
Author foundational Architectural Decision Records (ADRs) and Requests for Discussion (RFDs) that guide the technical direction of the company
Act as a role model and technical mentor for Senior and Mid-Level SREs, as well as Senior Product Engineers, elevating the overall technical culture of Rocket.Chat
Facilitate org-wide technical enablement sessions, knowledge sharing, and blameless culture advocacy
Partner with Engineering leadership, Product, Security, and Customer Success to align infrastructure strategy with business and customer needs
Represent Rocket.Chat’s technical vision through technical writing, conference talks, and community engagement within the infrastructure and open-source ecosystem
Foster a culture of ownership, operational excellence, and continuous improvement across the infrastructure organization

Requirements:

Strong background in software engineering and infrastructure architecture with experience designing and operating large-scale distributed systems
Expert understanding of microservices, event-driven architectures, stateful vs. stateless scaling constraints, and data consistency models
Advanced coding proficiency (Go preferred, Python acceptable) capable of building complex core frameworks and contributing to the core Rocket.Chat codebase when necessary
Deep expertise with Kubernetes and cloud infrastructure platforms (e.g., AWS, GCP, Azure, OVH) in production environments
Extensive experience with Infrastructure as Code (IaC) tools such as Terraform, Pulumi, or Ansible
Strong experience designing and managing CI/CD and GitOps deployment systems using tools like ArgoCD
Hands-on experience with observability platforms including monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki)
Strong understanding of networking fundamentals (TCP/IP, DNS, routing), security best practices, and cloud architecture principles
Experience leading infrastructure, SRE, or platform engineering teams responsible for production systems
Strong knowledge of containerized systems and deployment architectures supporting high availability and scalability
Familiarity with database technologies such as MongoDB or Redis and their operational considerations at scale
Experience supporting SaaS platforms with large-scale customer deployments
Experience managing multi-cluster Kubernetes environments and multi-region architectures
Ability to define and execute a long-term infrastructure vision aligned with company growth
Experience with open source software
Active U.S. Security Clearance (or eligibility to obtain one) is a strong plus

Senior Staff Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: