FirstPrinciples is a non-profit organization dedicated to building an autonomous AI Physicist to advance our understanding of fundamental physics. The Member of Technical Staff, Training Engineer will develop and lead the pre-training of large language models, making critical modeling choices and ensuring effective data pipelines and distributed training processes.
Responsibilities:
- Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs
- Tune optimizer configurations (AdamW/Adafactor/Sophia variants), learning rate schedules with warmup strategies, dropout, gradient clipping, weight decay, EMA, and activation checkpointing to ensure stability at scale
- Own model and training recipes end-to-end, making informed decisions about microbatch and global batch configurations
- Run ablations and scaling-law studies to set optimal tokens-to-train targets, entropy/perplexity goals, and checkpoint cadence that optimize cost-to-quality ratios
- Provide strategic insights to the executive team on financial implications of major decisions, from international expansion to new research initiatives
- Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability
- Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control
- Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling (e.g., D4-style approaches)
- Architect comprehensive data pipelines across diverse modalities (web/book/code/speech/vision) with filtering, heuristic and learned scoring, temperature sampling, multilingual balancing, and curriculum learning
- Demonstrate measurable impact from data quality work including large-scale deduplication, contamination audits, and repeat/mixture scheduling that improves downstream accuracy
- Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects (NCCL, NVLink/InfiniBand)
- Choose and configure optimal distributed strategies (FSDP vs ZeRO; 3D/5D hybrid parallelism for MoE) and launch parameters, documenting trade-offs for future reference
- Exploit modern kernels and mixed-precision training (FlashAttention-3, FP8 via NVIDIA Transformer Engine) to maximize tokens/sec while maintaining perplexity targets
- Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels while maintaining convergence guarantees
- Debug complex distributed training issues including deadlocks, OOMs, divergence, and stragglers using tools like Nsight, py-spy, TensorBoard, and W&B
- Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards
- Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms
- Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins
- Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures
- Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms
- Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs
- Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations, ensuring pre-training choices support downstream alignment objectives
- Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services
- Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques
- Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge