vCluster is a venture-backed tech startup pioneering Kubernetes virtualization for the AI era. As an AI Infrastructure Engineer, you will lead technical deployments and optimize infrastructure for GPU AI Clouds and enterprises, ensuring a seamless experience for customers.

Responsibilities:

Lead Technical Deployments: Drive end-to-end technical deployments for GPU neocloud and AI Factory customers, from initial bare metal configuration to a validated vCluster environment
Infrastructure Optimization: Configure and troubleshoot bare metal GPU node infrastructure, including CNI configuration, GPU Operator setup, distributed storage backends, and RDMA/InfiniBand
Validation: Deploy and validate Kubernetes and vCluster to provide GPU-powered managed K8s
Knowledge Transfer: Work alongside customer teams to build self-sufficiency, ensuring they can operate and grow the platform independently
Scaling through Documentation: Document reusable playbooks and deployment architectures so your learnings become the next customer's head start
Feedback Loop: Collaborate with Engineering and Product to surface recurring infrastructure challenges, acting as a direct feedback loop from the field into the roadmap
Strategic Partnering: Join Sales in the pre-sales process where deep infrastructure work is required to achieve a meaningful proof of value

Requirements:

5+ years of experience deploying and operating Kubernetes in production, ideally on bare metal or in high-complexity environments
Practical knowledge of NVIDIA GPU Operators, CUDA tooling, and systems-level configuration for GPU nodes
Deep understanding of CNI plugins, overlay networks, load balancing, and connectivity diagnosis in layered environments
Experience with persistent volume configuration, CSI drivers, and distributed systems like Ceph, Rook, Weka, or Longhorn
Comfort operating in ambiguous, fast-moving environments where you are often writing the playbook in real time
You thrive in environments that reject legacy tech and prefer a modern stack where you can solve a variety of problems from pipelines to internal services
Experience writing automation scripts with Bash, Python, or Go
Relevant certifications such as CKA (Certified Kubernetes Administrator) or experience writing Kubernetes Operators
Experience with inference serving, GPU scheduling, and the tooling around LLM deployment
Experience building AI Automation in documentation to contribute to a shared knowledge base

AI Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: