Zeta Global is an AI-Powered Marketing Cloud that utilizes advanced artificial intelligence to enhance customer acquisition and retention for marketers. The Lead AI Engineer will oversee the post-training lifecycle of large language models, focusing on Supervised Fine-Tuning and optimization while collaborating with various teams to integrate these models into production systems.
Responsibilities:
- Lead Supervised Fine-Tuning (SFT) of large language models in production, shaping instruction-following, reasoning quality, tone, and domain-specific behavior
- Extend SFT pipelines with instruction tuning and preference-based optimization (e.g., RLHF-style approaches or direct preference optimization)
- Design, curate, and maintain high-quality SFT and preference datasets, combining human-labeled and synthetic data tailored to real-world marketing and decisioning use cases
- Own model evaluation and benchmarking, including: Offline behavioral evals (instruction adherence, reasoning depth, hallucination rates) Online experiments and A/B tests Continuous regression detection and performance monitoring
- Develop and operate agentic LLM systems, enabling multi-step reasoning, tool use, workflow orchestration, and decision execution
- Implement and optimize prompting, retrieval-augmented generation (RAG), memory, and tool-calling strategies, with a clear understanding of when to solve problems via SFT versus prompting
- Partner closely with data engineering, platform, and product teams to integrate fine-tuned models into high-throughput, low-latency systems
- Establish best practices for LLM versioning, experimentation, deployment, rollback, governance, and safety
- Provide technical leadership and mentorship to engineers working on applied AI and LLM systems
Requirements:
- Significant hands-on experience with Supervised Fine-Tuning (SFT) of LLMs in production, beyond prompt-only approaches
- Direct experience using OpenAI APIs and/or AWS Bedrock for SFT, post-training, and deployment
- Strong understanding of LLM post-training workflows, including data preparation, instruction tuning, evaluation methodologies, and common failure modes
- Experience building and operating agentic LLM systems (tool use, multi-step reasoning, workflow orchestration)
- Proficiency in Python and modern ML frameworks (e.g., PyTorch)
- Experience operating ML systems in distributed, production environments
- Strong intuition for trade-offs between model quality, latency, cost, safety, and scalability