Reddit is a community-driven platform that facilitates open conversations among its users. The Senior Machine Learning Engineer will be responsible for architecting and maintaining the Machine Learning infrastructure that supports various functionalities like content recommendations and understanding, contributing significantly to Reddit's mission.
Responsibilities:
- Lead the building, testing, and maintenance of ML training infrastructure at Reddit
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
- Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
- GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
- Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the 'Idea-to-Prototype' loop, and standardize software environments (Docker images, Python dependency management)
Requirements:
- 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems
- Deep Kubernetes Expertise: You know K8s beyond just 'deploying pods.' You understand CRDs, Controllers and the Operator pattern
- Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms
- Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling)
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes
- Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP
- Experience working with distributed training frameworks, including Ray and Kubernetes
- Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
- Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle
- Strong organizational & communication skills