About this role

Together AI is a research-driven artificial intelligence company. They are seeking a Site Reliability Engineer to ensure user-facing services and production systems run smoothly while applying engineering principles and operational discipline.

Responsibilities:

Participate in on-call rotation (Pagerduty) to respond to production incidents
Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
Build monitoring systems to ensure the highest quality service for our customers
Design and implement operational processes (such as deployments and upgrades)
Debug production issues across all services and levels of the stack
Identify improvements for the product architecture from the reliability, performance and availability perspectives
Plan the growth of Together AI’s infrastructure

Requirements:

2+ years of professional SRE or related experience
Bachelor's degree in Computer Science or a related field or equivalent work experience
Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
Proficiency in programming/scripting languages
Direct experience in monitoring and observability practices
Knowledge of cloud services
Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: