NVIDIA is a leader in groundbreaking developments in Artificial Intelligence and High-Performance Computing. They are seeking a Senior AI Infrastructure Software Engineer to join their DGX Cloud Lepton Team, focusing on building AI/ML platforms and scalable AI infrastructure services that enhance productivity and efficiency of AI workloads.

Responsibilities:

Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure
Develop and optimize tools to improve AI/ML workload efficiency and resiliency
Root cause and analyze and triage failures from the application level to the hardware level
Enhance infrastructure and products underpinning NVIDIA's AI platforms
Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform
Define meaningful and actionable reliability metrics to track and improve system and service reliability
Skilled in problem-solving, root cause analysis, and optimization

Requirements:

Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience)
Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level
Proven track record in building and scaling large-scale distributed systems
Experience with AI training and inferencing and data infrastructure services
Familiar in Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki)
Proficiency in programming languages such as Python, C/C++, script languages
Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential
Experience in working with the large scale AI cluster and cloud-native infrastructure
Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)
Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, Dynamo, and Ray
Experience and root cause analysis of failures and datacenter scale
Strong background in software design and development

Senior AI Infrastructure Software Engineer - DGX Cloud

Key skills

About this role

Responsibilities:

Requirements: