PhysicsX is a deep-tech company focused on accelerating hardware innovation through AI-driven simulation software. The Principal ML Infrastructure Engineer will be responsible for extending and operating the infrastructure that supports research model training and serving pipelines, collaborating closely with ML engineers and research scientists.

Responsibilities:

Design and operate distributed training infrastructure for neural operator architectures (Transolver, Point Cloud Transformer, etc.) on our large NVIDIA DGX B200 platform
Optimize training pipelines for throughput, fault tolerance, and cost efficiency, including checkpointing strategies, gradient accumulation, and multi-node synchronization
Build and maintain experiment tracking and observability systems that give researchers clear visibility into training runs, hyperparameter sweeps, and model performance
Solve data loading bottlenecks for large-scale mesh datasets
Optimize data pipelines for efficient I/O from cloud storage, including prefetching, caching, and format optimization
Work with heterogeneous data sources of varying formats and resolutions
Build serving infrastructure for pre-trained LPMs, supporting both zero-shot inference and uncertainty quantification (Monte Carlo Dropout)
Design and implement model packaging pipelines for customer deployment. Models must run reliably in customer environments with fine-tuning capabilities
Ensure reproducibility: any model checkpoint should be deployable with consistent behaviour
Improve developer experience for the Research team with fast iteration cycles, reliable CI/CD, clear debugging tools
Collaborate with the broader Infrastructure team on shared patterns and standards

Principal Machine Learning Infrastructure Engineer

Key skills

About this role

Responsibilities: