NVIDIA has been transforming computer graphics and AI for over 30 years, and they are seeking a Senior Systems Software Engineer for AI Infrastructure. The role involves developing and maintaining large-scale systems for AI model training, collaborating on tooling for HPC and GPU training, and implementing SRE fundamentals to enhance system performance and reliability.
Responsibilities:
- Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds
- Collaborate on tooling for HPC, GPU Training, and AI Model training workflows
- Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution, driving continuous improvement in system performance
- Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems to improve team efficiency and system resilience
- Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead
- Work with engineering teams to deliver innovative solutions, uphold high standards for code and infrastructure, and contribute to hiring for a diverse, high-performing team
Requirements:
- Degree in Computer Science or related field, or equivalent experience with 5+ years in Software Development, SRE, or Production Engineering
- Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby)
- Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI)
- Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK)
- Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab)
- Strong communication skills with the ability to convey technical concepts effectively to diverse audiences
- Commitment to fostering a culture of diversity, curiosity, and continuous improvement
- Experience in AI training, inferencing, and data infrastructure services
- Proficiency in deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray
- A strong background in cloud or hardware health monitoring and system reliability
- Hands-on expertise in operating and scaling distributed systems with stringent SLAs, ensuring high availability and performance
- Knowledge of incident, change, and problem management processes, fostering continuous improvement in sophisticated environments