Lead the building, testing, and maintenance of ML training infrastructure at Reddit.
Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows.
Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows.
Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance.
GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.
Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management).
Requirements
5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems.
Deep Kubernetes Expertise: You know K8s beyond just "deploying pods." You understand CRDs, Controllers and the Operator pattern.
Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms.
Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling).
GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP.
Experience working with distributed training frameworks, including Ray and Kubernetes.
Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems.
Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
Strong organizational & communication skills.
Tech Stack
AWS
Cloud
Distributed Systems
Docker
Google Cloud Platform
Kubernetes
Python
Ray
Go
Benefits
Comprehensive Healthcare Benefits and Income Replacement Programs