Reddit is a community-driven platform known for its open and authentic conversations. The Senior Staff Machine Learning Engineer will lead the vision for Reddit’s large-scale GenAI Platform, focusing on the strategy, architecture, and operational model to support generative AI products across the company.
Responsibilities:
- Lead and execute the vision, strategy, and roadmap for Reddit’s large-scale GenAI Platform
- Define the platform architecture and operating model that enable teams to build, deploy, and scale GenAI products reliably
- Drive the strategy for a unified LAG Gateway supporting internally and externally hosted LLMs through consistent APIs and abstractions
- Set the direction for core platform capabilities such as rate and token limit management, intelligent failover, and production resilience
- Shape Reddit’s approach to an enterprise-grade RAG system
- Establish the strategic direction for agentic AI workflows and tool-use patterns across the platform
- Own the end-to-end platform strategy from concept through production adoption and long-term evolution
- Drive MLOps and LLMOps standards across CI/CD, testing, versioning, evaluation, and lifecycle management
- Define best practices for observability, monitoring, governance, and operational excellence across GenAI systems
- Partner across engineering, product, and leadership to align platform investments with company priorities and user needs
- Champion platform thinking with a strong focus on scalability, reliability, performance, and developer experience
- Influence technical direction across teams by turning emerging AI capabilities into a scalable platform strategy
Requirements:
- 10+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles
- Have a track record of leading technical strategy and delivering AI platforms in cloud-based production environments at scale
- Demonstrate strong execution by turning strategy into action, driving complex initiatives end to end, and consistently delivering high-quality platform outcomes
- Bring deep experience operating Kubernetes and other orchestration systems in large-scale production environments
- Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
- Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc
- Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders
- Strong focus on scalability, reliability, performance, and developer experience. You are an undying advocate for platform users and have a deep intuition for the genAI product development lifecycle
- Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus