Initialized Capital is focused on building quantum-accelerated AI servers to enhance AI training and inference. The ML Infrastructure Engineer will be responsible for building and managing the compute platform for the AI & Algorithms team, ensuring reliable GPU access and scalable workloads across various cloud providers and on-premise servers.
Responsibilities:
- Build compute abstractions that handle the team's diverse workloads: GPU-accelerated simulation, distributed training, high-throughput CPU jobs, and interactive analysis -- across PyTorch, JAX, and scientific computing frameworks
- Stand up experiment tracking and reproducibility infrastructure
- Create developer tooling that makes cloud compute feel local: environment setup, job submission, monitoring, and artifact management
- Scale experiments from single-GPU prototyping to multi-node production runs
- Design multi-provider workload orchestration: route jobs based on cost, availability, and capability
- Manage and optimize spend across cloud providers -- track credit balances, burn rates, and expiration dates
- Configure hybrid local + cloud workflows as on-prem GPU infrastructure comes online
- Coordinate with our infrastructure engineer on cloud administration and security
- Build CI/CD pipelines for research workloads: automated testing, evaluation benchmarks, artifact management
- Create data generation and preprocessing pipelines at the throughput the team's simulators demand
- Set up monitoring, alerting, and cost dashboards that surface problems before researchers hit them