Reddit is a community of communities, built on shared interests and trust, and is home to the most authentic conversations on the internet. As a Senior Machine Learning Engineer in the ML Training Platform team, you will architect and maintain foundational Machine Learning infrastructure that powers various features, enabling continuous improvement of ML systems.
Responsibilities:
- Lead the building, testing, and maintenance of ML training infrastructure at Reddit
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
- Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
- GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
- Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the 'Idea-to-Prototype' loop, and standardize software environments (Docker images, Python dependency management)
Requirements:
- 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems
- Deep Kubernetes Expertise: You know K8s beyond just 'deploying pods.' You understand CRDs, Controllers and the Operator pattern
- Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms
- Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling)
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes
- Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP
- Experience working with distributed training frameworks, including Ray and Kubernetes
- Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
- Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle
- Strong organizational & communication skills