Reddit is a community of communities and one of the internet’s largest sources of information. They are seeking a Senior Machine Learning Engineer to architect, implement, and maintain foundational Machine Learning infrastructure that supports various teams and functions within the company.
Responsibilities:
- Lead the building, testing, and maintenance of ML training infrastructure at Reddit
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
- Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
- GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
- Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management)