Together AI is a research-driven artificial intelligence company focused on creating innovative AI systems. As an AI Infrastructure Engineer, you will ensure the smooth operation of user-facing services and production systems while implementing best practices for reliability and scalability.
Responsibilities:
- Participate in on-call rotation (Pagerduty) to respond to production incidents
- Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
- Build monitoring systems to ensure the highest quality service for our customers
- Design and implement operational processes (such as deployments and upgrades)
- Debug production issues across all services and levels of the stack
- Identify improvements for the product architecture from the reliability, performance and availability perspectives
- Plan the growth of Together AI's infrastructure