Branch is a leading provider of engagement and performance mobile SaaS solutions, and they are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of their large-scale, distributed infrastructure. The role involves architecting complex systems, driving automation, and mentoring teams to improve operational efficiency and reliability.

Responsibilities:

Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale
Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs
Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene
Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies
Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security
Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance
Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org
Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency
Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles
Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments

Requirements:

6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments
Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems
Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals
Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling
Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty)
Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem
Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices
Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements
Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers
Strong communication, cross-functional leadership, and ability to influence engineering best practices
Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: