Pluralis Research is pioneering Protocol Learning, a decentralized method for training and deploying AI models. They are seeking an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure for their decentralized ML training platform, focusing on core systems for infrastructure orchestration and distributed compute.
Responsibilities:
- Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform)
- Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes
- Architect fault-tolerant infrastructure for distributed ML
- Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity
Requirements:
- 5+ years of work experience with deep experience in Infrastructure & Platform Engineering
- Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments
- Lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale
- Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration
- Decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints
- Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling)
- Hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response
- Experience in a startup environment with an emphasis on micro-services orchestration or big tech background
- Deep understanding of multi-cloud infra & distributed training systems
- A team player with high attention to detail
- A strong passion to join