Role Overview
- Design, develop, and maintain infrastructure for AI inference workloads, including GPU scheduling, model deployment pipelines, and data access patterns in on-prem environments
- Build and manage monitoring and observability tools for AI inference platforms, including dashboards, alerts, and runbooks for model health and system performance
- Collaborate with ML engineers and platform teams to design system architecture for AI workloads, integrate inference runtimes, and test performance at scale
Requirements
- Hands-on Experience In Containerization and Container Orchestration: Kubernetes, Helm, Docker/CRI-O
- Linux and networks
- Programming and Scripting: Python/Go/Bash
- Infrastructure as Code (IaC) approach: Ansible, Terraform
- Creating CI/CD pipelines: GitLab/GitHub actions
- Experience with Cluster API or any other "Kubeception" technology
- Deep experience with Kubernetes CNI, CSI, and Operators
- Nice to Have Knowledge in Kubernetes-related technologies such as ArgoCD, Helmfile
- Experience with Prometheus stack
- Experience with other Cloud Native technologies.
Tech Stack
- Ansible
- Cloud
- Docker
- Kubernetes
- Linux
- Prometheus
- Python
- Terraform
- Go
Benefits
- Competitive compensation
- Flexible working hours and hybrid or remote options, depending on your role
- Work from anywhere in the world for up to 45 days per year
- Private medical insurance for you and your family*
- Extra paid vacation and sick leave days*
- Support for life’s important moments and celebrations
- Language courses to help you connect and grow
- Modern, welcoming offices with snacks, drinks, and entertainment*
- Team sports and social activities*
*Benefits may vary depending on your location.