Sprout Social empowers businesses worldwide to harness the immense power and opportunity of social media. They are seeking a Senior ML Ops Engineer to build and maintain infrastructure for AI/ML at scale, manage the lifecycle of machine learning models, and support AI/ML scientists in model development and deployment.
Responsibilities:
- Build and maintain infrastructure using AWS, Terraform, and Kubernetes to support AI/ML at scale, including Generative AI applications
- Manage the end-to-end lifecycle of machine learning models, ensuring observability and tooling support both scale and speed
- Execute at scale while staying nimble enough to keep up with new capabilities being offered by social network APIs
- Improve processes and champion ideas that matter while holding the team accountable to high code quality and engineering standards
- Support our AI/ML Scientists by developing tooling to streamline model development and deployment
Requirements:
- 5+ years of experience developing and supporting AI/ML software in a production environment
- 5+ years of experience programming in object-oriented languages such as Java, Python, or C++
- Impact-oriented mindset with an interest in stability at scale and a willingness to engage in feature development
- 3+ years of experience developing and supporting scalable, distributed backend services
- 3+ years of experience building and supporting GPU-heavy services
- 1+ years of experience with LLMs / Generative AI, including managing their unique costs, constraints, and observability challenges
- 1+ years of experience with Infrastructure-as-Code (Terraform) and container orchestration (Kubernetes) within AWS environments