Zeta Global is an AI-Powered Marketing Cloud that leverages advanced artificial intelligence and consumer signals to enhance marketing efficiency. They are seeking a Senior Site Reliability Engineer to design, implement, and manage reliability metrics, develop production-grade software, and collaborate with engineering teams to ensure system reliability and scalability.
Responsibilities:
- Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives
- Develop production-grade software to enhance system reliability and reduce manual toil through automation
- Implement and optimize observability solutions using tools like OpenTelemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights
- Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence
- Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress
- Design and participate in Chaos Engineering exercises
- Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green)
- Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management
- Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance
Requirements:
- Can code confidently in Python or Golang and solve real-world problems through automation. (not only scripting)
- Have hands-on experience implementing SLIs, SLOs, and distributed tracing in production
- Understand Kubernetes, Terraform, and Infrastructure as Code tools
- Have hands-on experience with Chaos Engineering and anomaly detection
- Are excited about working with high-throughput, distributed systems processing millions of transactions daily
- 4+ years of experience as an SRE or in a similar role with hands-on coding
- 3+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code
- Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application
- Hands-on experience conducting postmortems and implementing observability at scale
- Hands-on experience conducting chaos engineering exercises
- Expertise in designing and implementing end-to-end observability solutions using tools like OpenTelemetry, Prometheus, Grafana, or Honeycomb
- Experience with distributed tracing and handling high-cardinality metrics in production environments
- 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools
- Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes)
- Hands-on experience with CI/CD platforms (GitOps, Jenkins, ArgoCD) and building automated pipelines
- Familiarity with tools and frameworks for incident management and operational automation
- Knowledge of modern deployment strategies (e.g., Canary, Blue-Green) and resiliency patterns (e.g., circuit breakers, retries)
- Strong analytical skills for statistical analysis of metrics to identify and resolve performance bottlenecks