Branch is a leading provider of engagement and performance mobile SaaS solutions, and they are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of their large-scale, distributed infrastructure. The role involves architecting complex systems, driving automation, and mentoring teams to improve operational efficiency and reliability.
Responsibilities:
- Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale
- Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs
- Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene
- Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies
- Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security
- Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance
- Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org
- Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency
- Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles
- Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments
Requirements:
- 6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments
- Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems
- Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals
- Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling
- Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty)
- Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem
- Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices
- Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements
- Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers
- Strong communication, cross-functional leadership, and ability to influence engineering best practices
- Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures