Braze is a leading customer engagement platform that empowers brands to deliver exceptional customer experiences. The Senior Site Reliability Engineer II will ensure the reliability and uptime of internal services and platforms, collaborating with engineering teams to improve infrastructure and automation.

Responsibilities:

Partner with Braze’s engineering teams on:
Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
Make monitoring and alerting alerts on symptoms and not on outages
Ensure that Braze meets our strict enterprise-grade SLAs with customers
Develop Braze’s internal platform infrastructure:
Create Infrastructure as code using Chef, Terraform, and Kubernetes
Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc
Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
Manage incidents:
Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
Use your on-call shift to prevent incidents from ever happening
Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc

Requirements:

5+ years of experience as a Software, DevOps, or Site Reliability Engineer
You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
Have an urge to collaborate, document, and deliver quickly
Collaborating across the global remote teams, often working asynchronously
Document everything so you don't need to learn the same thing (or plan the same work) twice
Delivering fast to delight our customers - even internal ones
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
Have a desire to solve everyday challenges facing software engineers and automate their toil away
Have an excellent ability to manage multiple tasks and expectations at once
Know your way around Linux and Unix Shell
Have strong programming skills - Ruby and/or Go preferred
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

Senior Site Reliability Engineer II

Key skills

About this role

Responsibilities:

Requirements: