Zeta Global is an AI-Powered Marketing Cloud that utilizes advanced artificial intelligence to enhance marketing efficiency. They are seeking a Principal DevOps Engineer to lead the transformation of software deployment and operations, ensuring safe and efficient CI/CD practices while managing platform reliability and compliance in a regulated environment.

Responsibilities:

Design, build, and operate production-grade CI/CD pipelines enabling multiple developers on multiple teams to deploy concurrently to production, multiple times daily, with zero-downtime guarantees
Implement and optimize advanced deployment strategies including canary releases, blue/green deployments, rolling updates, incremental rollouts, and feature flag-gated releases via Statsig
Build self-service deployment tooling that empowers developers to own their release process while enforcing safety guardrails, automated rollback triggers, and automate compliance gates
Establish deployment observability with real-time canary analysis, automated health scoring, and progressive delivery metrics integrated with Grafana, Prometheus, and Honeycomb
Champion CI/CD workflows using GitLab CI/CD, Helm charts, and Terraform to ensure infrastructure and application deployments are version-controlled, auditable, and reproducible
Define and enforce SLOs/SLIs/SLAs across services, establishing error budgets that balance velocity with reliability
Lead incident response processes, including on-call rotations, runbook development, blameless postmortems, and incident command structure
Design and implement robust observability stacks leveraging Grafana, Prometheus, Loki, and Honeycomb for metrics, logging, tracing, and alerting at scale
Proactively identify and eliminate reliability risks through chaos engineering, load testing, capacity planning, and failure mode analysis
Reduce operational toil through automation, self-healing infrastructure patterns, and intelligent alerting to minimize mean time to detection (MTTD) and recovery (MTTR)
Architect and manage Kubernetes clusters on AWS EKS at scale, including multi-cluster strategies, namespace governance, resource optimization, network policies, and security hardening
Manage and optimize AWS infrastructure spanning EC2, SQS, DynamoDB, and related services with Infrastructure as Code (Terraform) best practices
Design and operate Kafka-based event streaming infrastructure for high-throughput, low-latency data pipelines supporting real-time marketing and analytics workloads
Ensure robust networking across the platform, including DNS management, service mesh configuration, load balancing, TCP/IP optimization, routing policies, and VPC architecture
Manage containerization strategy using Docker, ensuring efficient image builds, vulnerability scanning, registry management, and runtime security
Support data infrastructure operations across Snowflake, MySQL, and other database platforms, collaborating with data engineering teams on reliability and performance
Embed compliance controls directly into CI/CD pipelines, ensuring automated enforcement of GDPR, CCPA, and SOC 2 requirements at every stage of the software delivery lifecycle
Implement audit trails, change management controls, and deployment approval workflows required by regulatory frameworks in the MarTech and AdTech domains
Collaborate with Security and Legal teams to ensure infrastructure and deployment processes meet global compliance obligations across all operating regions
Maintain awareness of evolving privacy regulations (ePrivacy, state-level US laws, international data residency requirements) and proactively adapt infrastructure accordingly
Serve as a technical leader and DevOps disruptor, challenging legacy processes and introducing modern practices that dramatically improve developer velocity and operational safety
Influence software architecture decisions to simplify and streamline operational management, advocating for patterns that are deployment-friendly, observable, and resilient by design
Clearly communicate complex technical strategies to engineering leadership, product stakeholders, and cross-functional teams to build alignment and drive adoption
Develop reference architectures, internal standards, and golden path templates that codify best practices and accelerate onboarding of new services and teams
Participate in on-call rotations and lead by example in incident response, demonstrating the operational discipline expected across the engineering organization

Requirements:

10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles, with demonstrated impact at staff or principal level
Expert-level Kubernetes knowledge, including cluster administration, Helm chart authoring, custom controllers/operators, network policies, RBAC, and multi-cluster management on AWS EKS
Deep expertise in CI/CD pipeline architecture and advanced deployment strategies (canary, blue/green, progressive delivery, feature flag integration) at scale
Strong proficiency with Infrastructure as Code using Terraform, including module design, state management, and multi-environment orchestration
Expert knowledge of Docker containerization, including multi-stage builds, security hardening, image optimization, and container runtime management
Production experience with Apache Kafka, including cluster management, topic design, consumer group strategies, and operational monitoring for high-throughput streaming workloads
Strong networking fundamentals: DNS (Route 53, internal DNS), TCP/IP, routing, API Gateway, load balancing (ALB/NLB), service mesh, VPC peering, transit gateways, and network troubleshooting
Extensive AWS experience spanning EKS, EC2, SQS, DynamoDB, IAM, VPC, CloudWatch, and related services in production environments
Hands-on experience with observability platforms: Grafana (dashboards, alerting), Prometheus (metrics, PromQL), Loki (log aggregation), and Honeycomb (distributed tracing, BubbleUp analysis)
Working familiarity with multiple language stacks including Node.js, React, Python, Java, and Ruby, sufficient to understand build systems, dependency management, and runtime characteristics
Experience operating within regulated environments, with practical knowledge of GDPR, CCPA, SOC 2, and compliance automation in MarTech or AdTech domains
Proven ability to influence engineering culture, drive adoption of new practices, and communicate complex technical strategies clearly to both technical and non-technical stakeholders
Demonstrated experience with GitLab CI/CD pipelines, including advanced pipeline features such as parent-child pipelines, dynamic environments, and security scanning integration
AWS certifications: Solutions Architect Professional, DevOps Engineer Professional, or Security Specialty
Experience with Statsig or similar feature flag and experimentation platforms for progressive delivery and A/B testing in production
Background in chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey) for proactive resilience validation
Experience building internal developer platforms (IDPs) or platform-as-a-product organizations
Familiarity with FinOps practices and cloud cost optimization strategies
Contributions to open-source DevOps/SRE tools or active participation in the broader infrastructure community
Experience with service mesh technologies (Istio, Linkerd) for advanced traffic management and security

Principal DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: