Opentensor Foundation is building the infrastructure for decentralized AI at internet scale. They are seeking a Research Engineer with expertise in distributed training to design scalable training solutions that utilize compute from globally distributed participants.

Responsibilities:

Drive innovative research efforts focused on building a large-scale, secure, and dependable system for orchestrating decentralized AI model training
Continuously refine and enhance AI workload efficiency by applying cutting-edge techniques in compute and memory optimization
Actively contribute to shaping our open-source tools and libraries that support scalable, distributed training of machine learning models
Share breakthroughs and research findings with the broader community through publications in premier conferences like NeurIPS
Keep a close eye on emerging trends in ML infrastructure, decentralized compute, and tooling—proactively identifying ways to evolve and improve the platform’s performance and developer experience

Requirements:

Proven expertise in AI and machine learning engineering, with a track record of building and scaling complete pipelines for training and deploying large-scale AI models
In-depth knowledge of distributed training methodologies and frameworks such as PyTorch Distributed, DeepSpeed, Ray, and MosaicML's LLM Foundry, with a focus on enhancing training efficiency and system scalability
Hands-on experience training large models at scale, leveraging advanced parallelism strategies including data, tensor, and pipeline parallel techniques
Strong grasp of modern MLOps workflows, including model lifecycle management, experiment logging, and automation through CI/CD processes
Deeply motivated by the mission to push the boundaries of decentralized AI training and make cutting-edge AI technology more accessible to a global community of developers and researchers

Research Engineer – Distributed Training

Key skills

About this role

Responsibilities:

Requirements: