Baseten is a company that powers mission-critical inference for dynamic AI companies. As a Site Reliability Engineer, you will build robust systems and processes to ensure scalable, reliable, and efficient infrastructure, while collaborating closely with users to improve the platform.
Responsibilities:
- Build and maintain scalable infrastructure to support the deployment and operation of machine learning models
- Establish standards and best practices for reliability and performance across the infrastructure
- Automate processes when relevant, particularly for managing CI/CD pipelines
- Own products and projects end-to-end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end-to-end execution
- Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions
- Mentor junior team members and contribute to knowledge sharing within the organization
- Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity
- Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates
Requirements:
- Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field
- Extensive experience with Kubernetes
- Experience in building and maintaining scalable infrastructure
- Experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Circle CI, Jenkins)
- Ability to own projects end-to-end, from project specification to execution
- No prior machine learning experience required, but should be open to learning about it
- Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, Opentelemetry) is a plus