Nscale is building next-generation AI infrastructure with a focus on GPU clusters for AI training and inference. The role involves leading the deployment of GPU clusters, ensuring they are validated and production-ready through hands-on execution and collaboration with various teams.
Responsibilities:
- Execute end-to-end bringup of GPU nodes and racks from installation to production readiness
- Validate BIOS/BMC/firmware configurations and GPU health
- Perform rack-level integration including power, cabling, and airflow validation
- Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet)
- Configure and validate leaf/spine network connectivity
- Run cluster-wide burn-in and stress testing
- Validate GPU-to-GPU and node-to-node performance (NCCL, RDMA, GPUDirect)
- Troubleshoot hardware, firmware, and fabric-level issues
- Contribute to automation for provisioning and cluster validation
- Improve deployment playbooks and documentation
- Identify reliability issues early and drive corrective actions
- Help turn ad hoc deployments into repeatable systems
- Work closely with networking, systems software, and data center teams
- Coordinate with hardware vendors to resolve bringup issues
- Support rapid capacity expansion as we scale
Requirements:
- 5–8+ years in infrastructure engineering, hardware deployment, or data center operations
- Hands-on experience deploying GPU servers (HGX/DGX or similar platforms)
- Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics)
- Strong Linux systems knowledge
- Experience troubleshooting distributed systems performance issues
- Comfortable working onsite in data center environments as needed
- Experience in AI/ML infrastructure or HPC environments
- Familiarity with NCCL, CUDA, RDMA
- Automation experience (Python, Ansible, Terraform, Bash)
- Experience in high-density power and cooling environments