NVIDIA is a leader in groundbreaking developments in Artificial Intelligence and High-Performance Computing. They are seeking a Senior AI Infrastructure Software Engineer to join their DGX Cloud Lepton Team, focusing on building AI/ML platforms and scalable AI infrastructure services that enhance productivity and efficiency of AI workloads.
Responsibilities:
- Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure
- Develop and optimize tools to improve AI/ML workload efficiency and resiliency
- Root cause and analyze and triage failures from the application level to the hardware level
- Enhance infrastructure and products underpinning NVIDIA's AI platforms
- Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform
- Define meaningful and actionable reliability metrics to track and improve system and service reliability
- Skilled in problem-solving, root cause analysis, and optimization
Requirements:
- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience)
- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level
- Proven track record in building and scaling large-scale distributed systems
- Experience with AI training and inferencing and data infrastructure services
- Familiar in Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki)
- Proficiency in programming languages such as Python, C/C++, script languages
- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential
- Experience in working with the large scale AI cluster and cloud-native infrastructure
- Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)
- Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, Dynamo, and Ray
- Experience and root cause analysis of failures and datacenter scale
- Strong background in software design and development