Zeta Global is an AI-Powered Marketing Cloud that utilizes advanced artificial intelligence to enhance marketing efficiency. They are seeking a Principal DevOps Engineer to lead the transformation of software deployment and operations, ensuring safe and efficient CI/CD practices while managing platform reliability and compliance in a regulated environment.
Responsibilities:
- Design, build, and operate production-grade CI/CD pipelines enabling multiple developers on multiple teams to deploy concurrently to production, multiple times daily, with zero-downtime guarantees
- Implement and optimize advanced deployment strategies including canary releases, blue/green deployments, rolling updates, incremental rollouts, and feature flag-gated releases via Statsig
- Build self-service deployment tooling that empowers developers to own their release process while enforcing safety guardrails, automated rollback triggers, and automate compliance gates
- Establish deployment observability with real-time canary analysis, automated health scoring, and progressive delivery metrics integrated with Grafana, Prometheus, and Honeycomb
- Champion CI/CD workflows using GitLab CI/CD, Helm charts, and Terraform to ensure infrastructure and application deployments are version-controlled, auditable, and reproducible
- Define and enforce SLOs/SLIs/SLAs across services, establishing error budgets that balance velocity with reliability
- Lead incident response processes, including on-call rotations, runbook development, blameless postmortems, and incident command structure
- Design and implement robust observability stacks leveraging Grafana, Prometheus, Loki, and Honeycomb for metrics, logging, tracing, and alerting at scale
- Proactively identify and eliminate reliability risks through chaos engineering, load testing, capacity planning, and failure mode analysis
- Reduce operational toil through automation, self-healing infrastructure patterns, and intelligent alerting to minimize mean time to detection (MTTD) and recovery (MTTR)
- Architect and manage Kubernetes clusters on AWS EKS at scale, including multi-cluster strategies, namespace governance, resource optimization, network policies, and security hardening
- Manage and optimize AWS infrastructure spanning EC2, SQS, DynamoDB, and related services with Infrastructure as Code (Terraform) best practices
- Design and operate Kafka-based event streaming infrastructure for high-throughput, low-latency data pipelines supporting real-time marketing and analytics workloads
- Ensure robust networking across the platform, including DNS management, service mesh configuration, load balancing, TCP/IP optimization, routing policies, and VPC architecture
- Manage containerization strategy using Docker, ensuring efficient image builds, vulnerability scanning, registry management, and runtime security
- Support data infrastructure operations across Snowflake, MySQL, and other database platforms, collaborating with data engineering teams on reliability and performance
- Embed compliance controls directly into CI/CD pipelines, ensuring automated enforcement of GDPR, CCPA, and SOC 2 requirements at every stage of the software delivery lifecycle
- Implement audit trails, change management controls, and deployment approval workflows required by regulatory frameworks in the MarTech and AdTech domains
- Collaborate with Security and Legal teams to ensure infrastructure and deployment processes meet global compliance obligations across all operating regions
- Maintain awareness of evolving privacy regulations (ePrivacy, state-level US laws, international data residency requirements) and proactively adapt infrastructure accordingly
- Serve as a technical leader and DevOps disruptor, challenging legacy processes and introducing modern practices that dramatically improve developer velocity and operational safety
- Influence software architecture decisions to simplify and streamline operational management, advocating for patterns that are deployment-friendly, observable, and resilient by design
- Clearly communicate complex technical strategies to engineering leadership, product stakeholders, and cross-functional teams to build alignment and drive adoption
- Develop reference architectures, internal standards, and golden path templates that codify best practices and accelerate onboarding of new services and teams
- Participate in on-call rotations and lead by example in incident response, demonstrating the operational discipline expected across the engineering organization
Requirements:
- 10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles, with demonstrated impact at staff or principal level
- Expert-level Kubernetes knowledge, including cluster administration, Helm chart authoring, custom controllers/operators, network policies, RBAC, and multi-cluster management on AWS EKS
- Deep expertise in CI/CD pipeline architecture and advanced deployment strategies (canary, blue/green, progressive delivery, feature flag integration) at scale
- Strong proficiency with Infrastructure as Code using Terraform, including module design, state management, and multi-environment orchestration
- Expert knowledge of Docker containerization, including multi-stage builds, security hardening, image optimization, and container runtime management
- Production experience with Apache Kafka, including cluster management, topic design, consumer group strategies, and operational monitoring for high-throughput streaming workloads
- Strong networking fundamentals: DNS (Route 53, internal DNS), TCP/IP, routing, API Gateway, load balancing (ALB/NLB), service mesh, VPC peering, transit gateways, and network troubleshooting
- Extensive AWS experience spanning EKS, EC2, SQS, DynamoDB, IAM, VPC, CloudWatch, and related services in production environments
- Hands-on experience with observability platforms: Grafana (dashboards, alerting), Prometheus (metrics, PromQL), Loki (log aggregation), and Honeycomb (distributed tracing, BubbleUp analysis)
- Working familiarity with multiple language stacks including Node.js, React, Python, Java, and Ruby, sufficient to understand build systems, dependency management, and runtime characteristics
- Experience operating within regulated environments, with practical knowledge of GDPR, CCPA, SOC 2, and compliance automation in MarTech or AdTech domains
- Proven ability to influence engineering culture, drive adoption of new practices, and communicate complex technical strategies clearly to both technical and non-technical stakeholders
- Demonstrated experience with GitLab CI/CD pipelines, including advanced pipeline features such as parent-child pipelines, dynamic environments, and security scanning integration
- AWS certifications: Solutions Architect Professional, DevOps Engineer Professional, or Security Specialty
- Experience with Statsig or similar feature flag and experimentation platforms for progressive delivery and A/B testing in production
- Background in chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey) for proactive resilience validation
- Experience building internal developer platforms (IDPs) or platform-as-a-product organizations
- Familiarity with FinOps practices and cloud cost optimization strategies
- Contributions to open-source DevOps/SRE tools or active participation in the broader infrastructure community
- Experience with service mesh technologies (Istio, Linkerd) for advanced traffic management and security