Nscale is building next-generation AI infrastructure with a focus on GPU clusters for AI training and inference. The role involves leading the deployment of GPU clusters, ensuring they are validated and production-ready through hands-on execution and collaboration with various teams.

Responsibilities:

Execute end-to-end bringup of GPU nodes and racks from installation to production readiness
Validate BIOS/BMC/firmware configurations and GPU health
Perform rack-level integration including power, cabling, and airflow validation
Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet)
Configure and validate leaf/spine network connectivity
Run cluster-wide burn-in and stress testing
Validate GPU-to-GPU and node-to-node performance (NCCL, RDMA, GPUDirect)
Troubleshoot hardware, firmware, and fabric-level issues
Contribute to automation for provisioning and cluster validation
Improve deployment playbooks and documentation
Identify reliability issues early and drive corrective actions
Help turn ad hoc deployments into repeatable systems
Work closely with networking, systems software, and data center teams
Coordinate with hardware vendors to resolve bringup issues
Support rapid capacity expansion as we scale

Requirements:

5–8+ years in infrastructure engineering, hardware deployment, or data center operations
Hands-on experience deploying GPU servers (HGX/DGX or similar platforms)
Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics)
Strong Linux systems knowledge
Experience troubleshooting distributed systems performance issues
Comfortable working onsite in data center environments as needed
Experience in AI/ML infrastructure or HPC environments
Familiarity with NCCL, CUDA, RDMA
Automation experience (Python, Ansible, Terraform, Bash)
Experience in high-density power and cooling environments

Senior Systems Engineer

Key skills

About this role

Responsibilities:

Requirements: