FirstPrinciples is a non-profit organization dedicated to building an autonomous AI Physicist to advance our understanding of fundamental physics. The Member of Technical Staff, Training Engineer will develop and lead the pre-training of large language models, making critical modeling choices and ensuring effective data pipelines and distributed training processes.

Responsibilities:

Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs
Tune optimizer configurations (AdamW/Adafactor/Sophia variants), learning rate schedules with warmup strategies, dropout, gradient clipping, weight decay, EMA, and activation checkpointing to ensure stability at scale
Own model and training recipes end-to-end, making informed decisions about microbatch and global batch configurations
Run ablations and scaling-law studies to set optimal tokens-to-train targets, entropy/perplexity goals, and checkpoint cadence that optimize cost-to-quality ratios
Provide strategic insights to the executive team on financial implications of major decisions, from international expansion to new research initiatives
Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability
Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control
Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling (e.g., D4-style approaches)
Architect comprehensive data pipelines across diverse modalities (web/book/code/speech/vision) with filtering, heuristic and learned scoring, temperature sampling, multilingual balancing, and curriculum learning
Demonstrate measurable impact from data quality work including large-scale deduplication, contamination audits, and repeat/mixture scheduling that improves downstream accuracy
Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects (NCCL, NVLink/InfiniBand)
Choose and configure optimal distributed strategies (FSDP vs ZeRO; 3D/5D hybrid parallelism for MoE) and launch parameters, documenting trade-offs for future reference
Exploit modern kernels and mixed-precision training (FlashAttention-3, FP8 via NVIDIA Transformer Engine) to maximize tokens/sec while maintaining perplexity targets
Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels while maintaining convergence guarantees
Debug complex distributed training issues including deadlocks, OOMs, divergence, and stragglers using tools like Nsight, py-spy, TensorBoard, and W&B
Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards
Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms
Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins
Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures
Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms
Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs
Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations, ensuring pre-training choices support downstream alignment objectives
Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services
Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques
Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

Key skills

About this role

Responsibilities: