Audiience™ is transforming how content is created and trusted in publishing through innovative technology. The role involves architecting and owning the end-to-end ML training infrastructure, building scalable training pipelines, and collaborating with research teams to enhance production-ready pipelines.

Responsibilities:

Architect and own the end-to-end ML training infrastructure - from data ingestion through experiment tracking to model checkpointing
Build scalable, reproducible training pipelines that empower the research team to iterate fast without chaos
Own compute orchestration, distributed training setups, and GPU cluster management
Implement and manage experiment tracking (W&B, MLflow) and version-controlled data pipelines
Collaborate directly with research to harden experimental approaches into production-ready pipelines
Identify and systematically eliminate bottlenecks in training speed, cost, and reliability

Requirements:

Deep experience with distributed training frameworks (FSDP, DeepSpeed, Megatron, or equivalent)
Strong proficiency in PyTorch and modern ML tooling
Experience with cloud compute orchestration (AWS, GCP, or Azure) at training scale
Familiarity with containerization (Docker, Kubernetes) for ML workloads
Solid understanding of ML experiment tracking, data versioning, and reproducibility
Ability to profile and optimize training throughput and resource utilization
Communication excellence – Can clearly articulate infrastructure decisions, tradeoffs, and failure post-mortems in writing
Demonstrated ability to document systems architecture and explain technical decisions to a cross-functional team
Prior experience as an ML infrastructure engineer, MLOps engineer, or applied researcher with strong infra instincts
Startup or fast-moving environment experience is a plus
Problem-solving prowess – You see problems others don't and solve them in ways others can't
Tenacious learner – Self-taught capabilities and continuous improvement are in your DNA
Systems thinker – You understand how complex systems interact and create elegant solutions
Results-oriented – Bias toward flexibility, impact, and getting it done
Collaborative by nature – You believe we can only win if we do it together
Experience with CUDA, fused kernels, or low-level performance optimization
Prior work with LLMs or large-scale foundation model training
Experience building internal ML platforms or developer tooling for research teams
Open-source contributions or published engineering writeups
Previous startup or early-stage engineering experience
Volunteer work

Principal AI-ML Infrastructure & Training Engineer

Key skills

About this role

Responsibilities:

Requirements: