Reddit is a community-driven platform with a focus on authentic conversations and shared interests. They are seeking a Senior Machine Learning Engineer to architect and maintain foundational ML infrastructure that supports various teams and enhances user experience through advanced machine learning techniques.

Responsibilities:

Lead the building, testing, and maintenance of ML training infrastructure at Reddit
Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the 'Idea-to-Prototype' loop, and standardize software environments (Docker images, Python dependency management)

Requirements:

5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems
Deep Kubernetes Expertise: You know K8s beyond just 'deploying pods.' You understand CRDs, Controllers and the Operator pattern
Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms
Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling)
GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes
Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP
Experience working with distributed training frameworks, including Ray and Kubernetes
Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle
Strong organizational & communication skills

Senior Machine Learning Engineer, ML Training Platform

Key skills

About this role

Responsibilities:

Requirements: