Recruits Lab is a well-funded, advanced AI research company focused on building next-generation foundation models. They are seeking a Senior Machine Learning Engineer to join their core model engineering team, responsible for building, scaling, and optimizing large language models and leading engineering efforts in distributed training and performance optimization.
Responsibilities:
- Lead end-to-end engineering of large language models (10B–100B+ parameters)
- Implement large-scale pre-training, SFT, and alignment pipelines
- Optimize model architectures and training strategies based on scaling laws and product objectives
- Drive measurable improvements in performance, reasoning capability, and training efficiency
- Architect and optimize multi-node GPU distributed training systems (A100 / H100 / B200 environments)
- Implement advanced parallel strategies: Data, Tensor, Pipeline, and Sequence Parallelism
- Maximize Model FLOPs Utilization (MFU) and overall cluster efficiency
- Improve training stability, fault tolerance, and monitoring
- Build and maintain TB–PB scale data pipelines
- Implement ingestion, cleaning, deduplication (MinHash/LSH), safety filtering, and PII removal
- Support multimodal data strategies, synthetic data generation, and curriculum learning
- Productionize alignment techniques (RLHF, DPO, KTO)
- Work with Mixture-of-Experts (MoE) architectures and routing optimization
- Improve model reasoning, math, and coding performance
- Build and enhance agent and tool-calling systems
- Uphold strong coding and system design standards
- Identify and eliminate performance bottlenecks
- Take ownership of major system components end-to-end
Requirements:
- MS/PhD in Computer Science, AI, Mathematics, or equivalent practical experience
- Strong hands-on experience in engineering and optimizing large-scale deep learning systems
- Deep understanding of Transformer architectures (RoPE, FlashAttention, SwiGLU)
- Experience working with modern open-source or proprietary LLMs
- Advanced proficiency in PyTorch or JAX
- Experience with Megatron-LM, DeepSpeed, FSDP, or equivalent frameworks
- Strong understanding of 3D parallelism and ZeRO optimization strategies
- Hands-on experience training on large GPU clusters (100+ GPUs preferred)
- Familiarity with InfiniBand, RDMA, and storage I/O optimization
- Experience debugging large distributed training runs
- Highly self-driven and execution-focused
- Strong system ownership mindset
- Comfortable operating in fast-moving R&D environments
- Open-source contributions in the LLM ecosystem
- Experience building agentic systems or multi-step reasoning frameworks
- CUDA or Triton kernel optimization experience
- Published research or major production LLM deployments