Reddit is a community-driven platform known for its open conversations and diverse interests. They are seeking a Senior Staff ML Infra Engineer to enhance their ML-powered ad ranking systems through improved infrastructure and automation. The role involves architecting ML training systems, driving system reliability, and mentoring engineers across teams.
Responsibilities:
- Architect and significantly influence ML training/serving systems and tooling that unlock faster iteration and larger models
- Drive adoption and reliability of ML systems and own a portfolio of initiatives across multiple teams
- Improve efficiency: GPU utilization, training runtime, data loading, feature performance, and serving latency
- Write high-quality design docs; run design reviews; set standards for correctness, reliability, and velocity
- Partner cross-org on shared infra sequencing, requirements, and adoption
- Mentor other engineers and contribute to engineering best practices across the org
Requirements:
- 9+ years industry experience in Software / ML Infra Engineering
- Strong systems background: distributed systems, data pipelines, service design, performance tuning, and operational excellence
- Demonstrated experience improving ML training/serving reliability at scale
- Can reason deeply about infra tradeoffs that affect modeling velocity and product outcomes
- Ability to lead through influence: drive adoption across teams and build durable interfaces/standards
- Demonstrated impact: leading cross-team technical initiatives and influencing technical direction
- Excellent communication: can explain tradeoffs clearly to both technical and non-technical stakeholders
- Comfortable mentoring other engineers and raising engineering quality through reviews and technical guidance