Baseten is a company that powers mission-critical inference for dynamic AI companies. As a Site Reliability Engineer, you will build robust systems and processes to ensure scalable, reliable, and efficient infrastructure, while collaborating closely with users to improve the platform.

Responsibilities:

Build and maintain scalable infrastructure to support the deployment and operation of machine learning models
Establish standards and best practices for reliability and performance across the infrastructure
Automate processes when relevant, particularly for managing CI/CD pipelines
Own products and projects end-to-end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end-to-end execution
Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions
Mentor junior team members and contribute to knowledge sharing within the organization
Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity
Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates

Requirements:

Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field
Extensive experience with Kubernetes
Experience in building and maintaining scalable infrastructure
Experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Circle CI, Jenkins)
Ability to own projects end-to-end, from project specification to execution
No prior machine learning experience required, but should be open to learning about it
Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, Opentelemetry) is a plus

Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: