Opentensor Foundation is building the infrastructure for decentralized AI at internet scale. They are seeking a Research Engineer with expertise in distributed training to design scalable training solutions that utilize compute from globally distributed participants.
Responsibilities:
- Drive innovative research efforts focused on building a large-scale, secure, and dependable system for orchestrating decentralized AI model training
- Continuously refine and enhance AI workload efficiency by applying cutting-edge techniques in compute and memory optimization
- Actively contribute to shaping our open-source tools and libraries that support scalable, distributed training of machine learning models
- Share breakthroughs and research findings with the broader community through publications in premier conferences like NeurIPS
- Keep a close eye on emerging trends in ML infrastructure, decentralized compute, and tooling—proactively identifying ways to evolve and improve the platform’s performance and developer experience
Requirements:
- Proven expertise in AI and machine learning engineering, with a track record of building and scaling complete pipelines for training and deploying large-scale AI models
- In-depth knowledge of distributed training methodologies and frameworks such as PyTorch Distributed, DeepSpeed, Ray, and MosaicML's LLM Foundry, with a focus on enhancing training efficiency and system scalability
- Hands-on experience training large models at scale, leveraging advanced parallelism strategies including data, tensor, and pipeline parallel techniques
- Strong grasp of modern MLOps workflows, including model lifecycle management, experiment logging, and automation through CI/CD processes
- Deeply motivated by the mission to push the boundaries of decentralized AI training and make cutting-edge AI technology more accessible to a global community of developers and researchers