Pluralis Research is pioneering Protocol Learning, a decentralized method for training and deploying AI models. They are seeking an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure for their decentralized ML training platform, focusing on core systems for infrastructure orchestration and distributed compute.

Responsibilities:

Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform)
Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes
Architect fault-tolerant infrastructure for distributed ML
Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity

Requirements:

5+ years of work experience with deep experience in Infrastructure & Platform Engineering
Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments
Lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale
Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration
Decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints
Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling)
Hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response
Experience in a startup environment with an emphasis on micro-services orchestration or big tech background
Deep understanding of multi-cloud infra & distributed training systems
A team player with high attention to detail
A strong passion to join

Machine Learning Engineer - ML Training Platform

Key skills

About this role

Responsibilities:

Requirements: