Vero is an exciting AI infrastructure startup partnering with NVIDIA to shape the future of data centers. The role of DevOps Engineer involves designing, deploying, and managing pipelines and infrastructure for large-scale NVIDIA GPU systems supporting AI/ML workloads and HPC clusters.
Responsibilities:
- Design, deploy and manage DevOps pipelines supporting large-scale GPU infrastructure and distributed AI/ML and HPC workloads
- Automate provisioning, monitoring and maintenance of high-performance GPU environments
- Optimize performance, stability and resource utilization across Kubernetes clusters in a liquid-cooled data center environment
- Work closely with infrastructure, hardware and software teams to integrate compute, networking and cooling systems
- Monitor system health and build tooling to track temperature, pressure and power usage across high-density environments
- Troubleshoot deployment, scaling and performance issues across distributed GPU systems
- Implement CI/CD pipelines, infrastructure-as-code and security best practices to support reliable deployments at scale
- Develop automation and tooling using Python and Bash
Requirements:
- 5+ years experience in DevOps, SRE, platform engineering or infrastructure roles
- 2+ years working with GPU infrastructure, HPC clusters or other high-performance compute environments
- Kubernetes & Slurm
- Terraform, Ansible, Bash or Python
- CI/CD pipelines for large-scale distributed systems
- Monitoring and telemetry tools such as Prometheus, Grafana or Redfish
- Comfortable operating in high-availability, uptime-critical environments
- Certifications such as CKA, DevOps Engineer or HashiCorp Terraform Associate