Design, implement, and maintain reliable, scalable, and secure infrastructure, applications, and tooling, with a focus on our ML/AI pipelines and workloads
Write clean, maintainable code, and perform peer code-reviews
Write clear and concise documentation and engage in cross-team communication and knowledge sharing
Work with other team members to investigate design approaches, prototype new technology and evaluate technical feasibility
Pair with adjacent teams to understand how your frameworks and infrastructure are actually used in the field, continuously improving them and leveraging recent advances to improve developer velocity
Requirements
At least 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
B.S. or M.S. degree (or equivalent) in Computer Science, Engineering, or a related field
Proficiency in at least one modern programming languages (Python, Javascript, C#, Go, etc)
Demonstrated industry best-practices in MLOps
Proficiency with CI/CD tools and GitOps workflows
Familiarity with running GPU workloads in kubernetes
Strong knowledge of Kubernetes (self-hosted and managed) and modern k8s paradigms (e.g. CNCF)
Proficiency with Infrastructure as Code tools (Terraform, etc) and configuration management tools (Ansible, etc)
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry)