Rocket.Chat is the world's largest open-source communications platform, seeking a Senior Staff Site Reliability Engineer to lead the reliability and operational excellence of their infrastructure. The role involves overseeing cloud infrastructure, Kubernetes platforms, and ensuring systems run efficiently across global deployments while collaborating with various teams to support product growth and operational resilience.
Responsibilities:
- Influence core product architecture (Core, Fleetcommand, Omnichannel, etc.) before code is written to ensure reliability, scalability, and operability are baked in by design
- Lead the engineering of systemic solutions that eliminate entire classes of failures, moving the organization from reactive firefighting to proactive prevention
- Act as the technical visionary for our deployment products (LaunchControl, Airlock, Launchpad), defining the long-term technical roadmap and architectural standards alongside the Head of Infrastructure
- Design, prototype, and build foundational tooling, core libraries, and frameworks (in Go, Python, etc.) that make it easier for both SREs and Product Engineers to deploy safely, monitor accurately, and operate efficiently
- Champion and evolve the Infrastructure as Code (IaC) paradigms (Pulumi, Terraform) to ensure they meet the needs of increasingly complex, multi-region, and air-gapped enterprise deployments
- Serve as the highest level of technical escalation for catastrophic, multi-domain Sev-1 incidents that baffle standard operational protocols
- Drive the strategic direction of incident management, ensuring that post-mortems result in structural, org-wide improvements rather than localized band-aids
- Evolve the company's disaster recovery (DR) and chaos engineering programs to simulate and defend against complex cascading failures
- Define, document, and enforce global standards for observability (SLIs, SLOs, error budgets), alerting, and production readiness across all engineering squads
- Author foundational Architectural Decision Records (ADRs) and Requests for Discussion (RFDs) that guide the technical direction of the company
- Act as a role model and technical mentor for Senior and Mid-Level SREs, as well as Senior Product Engineers, elevating the overall technical culture of Rocket.Chat
- Facilitate org-wide technical enablement sessions, knowledge sharing, and blameless culture advocacy
- Partner with Engineering leadership, Product, Security, and Customer Success to align infrastructure strategy with business and customer needs
- Represent Rocket.Chat’s technical vision through technical writing, conference talks, and community engagement within the infrastructure and open-source ecosystem
- Foster a culture of ownership, operational excellence, and continuous improvement across the infrastructure organization
Requirements:
- Strong background in software engineering and infrastructure architecture with experience designing and operating large-scale distributed systems
- Expert understanding of microservices, event-driven architectures, stateful vs. stateless scaling constraints, and data consistency models
- Advanced coding proficiency (Go preferred, Python acceptable) capable of building complex core frameworks and contributing to the core Rocket.Chat codebase when necessary
- Deep expertise with Kubernetes and cloud infrastructure platforms (e.g., AWS, GCP, Azure, OVH) in production environments
- Extensive experience with Infrastructure as Code (IaC) tools such as Terraform, Pulumi, or Ansible
- Strong experience designing and managing CI/CD and GitOps deployment systems using tools like ArgoCD
- Hands-on experience with observability platforms including monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki)
- Strong understanding of networking fundamentals (TCP/IP, DNS, routing), security best practices, and cloud architecture principles
- Experience leading infrastructure, SRE, or platform engineering teams responsible for production systems
- Strong knowledge of containerized systems and deployment architectures supporting high availability and scalability
- Familiarity with database technologies such as MongoDB or Redis and their operational considerations at scale
- Experience supporting SaaS platforms with large-scale customer deployments
- Experience managing multi-cluster Kubernetes environments and multi-region architectures
- Ability to define and execute a long-term infrastructure vision aligned with company growth
- Experience with open source software
- Active U.S. Security Clearance (or eligibility to obtain one) is a strong plus