vCluster is a venture-backed tech startup pioneering Kubernetes virtualization for the AI era. As an AI Infrastructure Engineer, you will lead technical deployments and optimize infrastructure for GPU AI Clouds and enterprises, ensuring a seamless experience for customers.
Responsibilities:
- Lead Technical Deployments: Drive end-to-end technical deployments for GPU neocloud and AI Factory customers, from initial bare metal configuration to a validated vCluster environment
- Infrastructure Optimization: Configure and troubleshoot bare metal GPU node infrastructure, including CNI configuration, GPU Operator setup, distributed storage backends, and RDMA/InfiniBand
- Validation: Deploy and validate Kubernetes and vCluster to provide GPU-powered managed K8s
- Knowledge Transfer: Work alongside customer teams to build self-sufficiency, ensuring they can operate and grow the platform independently
- Scaling through Documentation: Document reusable playbooks and deployment architectures so your learnings become the next customer's head start
- Feedback Loop: Collaborate with Engineering and Product to surface recurring infrastructure challenges, acting as a direct feedback loop from the field into the roadmap
- Strategic Partnering: Join Sales in the pre-sales process where deep infrastructure work is required to achieve a meaningful proof of value
Requirements:
- 5+ years of experience deploying and operating Kubernetes in production, ideally on bare metal or in high-complexity environments
- Practical knowledge of NVIDIA GPU Operators, CUDA tooling, and systems-level configuration for GPU nodes
- Deep understanding of CNI plugins, overlay networks, load balancing, and connectivity diagnosis in layered environments
- Experience with persistent volume configuration, CSI drivers, and distributed systems like Ceph, Rook, Weka, or Longhorn
- Comfort operating in ambiguous, fast-moving environments where you are often writing the playbook in real time
- You thrive in environments that reject legacy tech and prefer a modern stack where you can solve a variety of problems from pipelines to internal services
- Experience writing automation scripts with Bash, Python, or Go
- Relevant certifications such as CKA (Certified Kubernetes Administrator) or experience writing Kubernetes Operators
- Experience with inference serving, GPU scheduling, and the tooling around LLM deployment
- Experience building AI Automation in documentation to contribute to a shared knowledge base