PhysicsX is a deep-tech company focused on accelerating hardware innovation through AI-driven simulation software. The Principal ML Infrastructure Engineer will be responsible for extending and operating the infrastructure that supports research model training and serving pipelines, collaborating closely with ML engineers and research scientists.
Responsibilities:
- Design and operate distributed training infrastructure for neural operator architectures (Transolver, Point Cloud Transformer, etc.) on our large NVIDIA DGX B200 platform
- Optimize training pipelines for throughput, fault tolerance, and cost efficiency, including checkpointing strategies, gradient accumulation, and multi-node synchronization
- Build and maintain experiment tracking and observability systems that give researchers clear visibility into training runs, hyperparameter sweeps, and model performance
- Solve data loading bottlenecks for large-scale mesh datasets
- Optimize data pipelines for efficient I/O from cloud storage, including prefetching, caching, and format optimization
- Work with heterogeneous data sources of varying formats and resolutions
- Build serving infrastructure for pre-trained LPMs, supporting both zero-shot inference and uncertainty quantification (Monte Carlo Dropout)
- Design and implement model packaging pipelines for customer deployment. Models must run reliably in customer environments with fine-tuning capabilities
- Ensure reproducibility: any model checkpoint should be deployable with consistent behaviour
- Improve developer experience for the Research team with fast iteration cycles, reliable CI/CD, clear debugging tools
- Collaborate with the broader Infrastructure team on shared patterns and standards