Snorkel AI is on a mission to help enterprises transform expert knowledge into specialized AI at scale. As an Applied Research Engineer, you will own the infrastructure that powers model training and evaluation, building and operating GPU cluster infrastructure and training pipelines while collaborating closely with research scientists and engineers.

Responsibilities:

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change

Requirements:

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change
Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management
Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems
Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication
Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools
Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation
Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems
Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required

Applied Research Engineer – Training Infra

Key skills

About this role

Responsibilities:

Requirements: