Lead customer architecture & design, translating HPC/AI workload requirements into scalable cluster architectures
Deploy and operationalize clusters using Omnia or similar automation
Build and maintain provisioning workflows (OpenCHAMI-based or equivalent)
Serve as Tier-3 engineering escalation, troubleshooting complex provisioning, scheduling, GPU, networking, and performance issues
Contribute to open source and customer enablement through code contributions, documentation, workshops, runbooks, templates, and field readiness materials
Requirements
8+ years engineering large-scale HPC and distributed infrastructure
Strong knowledge of cluster architecture, schedulers, and provisioning workflows
Deep experience with RHEL/Rocky/Ubuntu
Hands-on cluster deployments using open-source toolchains, Omnia, and OpenCHAMI
Production experience with Slurm and/or Kubernetes
Proficient with Docker/Podman, OpenTelemetry pipelines, and telemetry instrumentation