Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm.
Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and InfiniBand/RoCE.
Deploy and manage GPU Operator and Network Operator for large fleets.
Design and validate cloud‑native HPC environments with low latency and high bandwidth.
Define and document reference architectures for AI model training and MLOps.
Collaborate with NVIDIA and other partners to evaluate new GPU generations and software stacks.
Benchmark performance, track down bottlenecks, and recommend concrete changes.
Lead design sessions and architecture reviews with customers focused on performance and reliability.
Requirements
A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).
3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid.
Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.
A solid handle on HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch.
Experience with storage and I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar systems.
Comfort with Terraform, Ansible, Helm, and GitOps‑style workflows.
Good scripting skills in Python or Bash.
You write and speak clearly, can lead a design review without losing the room, and can keep both engineers and non‑technical stakeholders on the same page.
Legal authorization to work in the U.S. on a full-time basis without visa sponsorship.
Tech Stack
Ansible
Cloud
Docker
Kubernetes
Linux
NFS
Python
Terraform
Benefits
100% employer‑paid medical, dental, and vision for you and your family
4% 401(k) match with immediate vesting
Company‑paid short‑ and long‑term disability and life insurance
20 weeks paid parental leave for primary caregivers, 12 weeks for secondary
Support for your home office (mobile + internet stipend)