Audiience™ is transforming how content is created and trusted in publishing through innovative technology. The role involves architecting and owning the end-to-end ML training infrastructure, building scalable training pipelines, and collaborating with research teams to enhance production-ready pipelines.
Responsibilities:
- Architect and own the end-to-end ML training infrastructure - from data ingestion through experiment tracking to model checkpointing
- Build scalable, reproducible training pipelines that empower the research team to iterate fast without chaos
- Own compute orchestration, distributed training setups, and GPU cluster management
- Implement and manage experiment tracking (W&B, MLflow) and version-controlled data pipelines
- Collaborate directly with research to harden experimental approaches into production-ready pipelines
- Identify and systematically eliminate bottlenecks in training speed, cost, and reliability
Requirements:
- Deep experience with distributed training frameworks (FSDP, DeepSpeed, Megatron, or equivalent)
- Strong proficiency in PyTorch and modern ML tooling
- Experience with cloud compute orchestration (AWS, GCP, or Azure) at training scale
- Familiarity with containerization (Docker, Kubernetes) for ML workloads
- Solid understanding of ML experiment tracking, data versioning, and reproducibility
- Ability to profile and optimize training throughput and resource utilization
- Communication excellence – Can clearly articulate infrastructure decisions, tradeoffs, and failure post-mortems in writing
- Demonstrated ability to document systems architecture and explain technical decisions to a cross-functional team
- Prior experience as an ML infrastructure engineer, MLOps engineer, or applied researcher with strong infra instincts
- Startup or fast-moving environment experience is a plus
- Problem-solving prowess – You see problems others don't and solve them in ways others can't
- Tenacious learner – Self-taught capabilities and continuous improvement are in your DNA
- Systems thinker – You understand how complex systems interact and create elegant solutions
- Results-oriented – Bias toward flexibility, impact, and getting it done
- Collaborative by nature – You believe we can only win if we do it together
- Experience with CUDA, fused kernels, or low-level performance optimization
- Prior work with LLMs or large-scale foundation model training
- Experience building internal ML platforms or developer tooling for research teams
- Open-source contributions or published engineering writeups
- Previous startup or early-stage engineering experience
- Volunteer work