Vero is an exciting AI infrastructure startup partnering with NVIDIA to shape the future of data centers. The role of DevOps Engineer involves designing, deploying, and managing pipelines and infrastructure for large-scale NVIDIA GPU systems supporting AI/ML workloads and HPC clusters.

Responsibilities:

Design, deploy and manage DevOps pipelines supporting large-scale GPU infrastructure and distributed AI/ML and HPC workloads
Automate provisioning, monitoring and maintenance of high-performance GPU environments
Optimize performance, stability and resource utilization across Kubernetes clusters in a liquid-cooled data center environment
Work closely with infrastructure, hardware and software teams to integrate compute, networking and cooling systems
Monitor system health and build tooling to track temperature, pressure and power usage across high-density environments
Troubleshoot deployment, scaling and performance issues across distributed GPU systems
Implement CI/CD pipelines, infrastructure-as-code and security best practices to support reliable deployments at scale
Develop automation and tooling using Python and Bash

Requirements:

5+ years experience in DevOps, SRE, platform engineering or infrastructure roles
2+ years working with GPU infrastructure, HPC clusters or other high-performance compute environments
Kubernetes & Slurm
Terraform, Ansible, Bash or Python
CI/CD pipelines for large-scale distributed systems
Monitoring and telemetry tools such as Prometheus, Grafana or Redfish
Comfortable operating in high-availability, uptime-critical environments
Certifications such as CKA, DevOps Engineer or HashiCorp Terraform Associate

DevOps Engineer (GPU)

Key skills

About this role

Responsibilities:

Requirements: