NVIDIA is seeking a Senior Infrastructure and Build Systems Engineer for their AI TensorRT-LLM team. The role involves taking ownership of infrastructure and DevOps landscapes, managing CI/CD pipelines, and collaborating with cross-functional teams to enhance efficiency and reliability in deep learning deployment.
Responsibilities:
- Building and maintaining infrastructure from first principles needed to deliver TensorRT LLM
- Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks
- Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc
- Enable performing scans and handling of security CVEs for infrastructure components
- Improve the modularity of our build systems using CMake
- Use AI to help build automated triaging workflows
- Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform
Requirements:
- Masters degree or equivalent experience
- 3+ years of experience in Computer Science, computer architecture, or related field
- Ability to work in a fast-paced, agile team environment
- Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design
- Experience with CMake
- Background with Security best practices for releasing libraries
- Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms
- Highly skilled in Kubernetes and Docker/containerd
- Automation expert with hands-on skills in frameworks like Ansible & Terraform
- Experience in AWS, Azure or GCP
- Experience contributing to a large open-source deep learning community - use of GitHub, bug tracking, branching and merging code, OSS licensing issues handling patches, etc
- Experience in defining and leading the DevOps strategy (design patterns, reliability and scaling) for a team or organization
- Experience driving efficiencies in software architecture, creating metrics, implementing infrastructure as code and other automation improvements
- Deep understanding of test automation infrastructure, framework and test analysis
- Excellent problem solving abilities spanning multiple software (storage systems, kernels and containers) as well as collaborating within an agile team environment to prioritize deep learning-specific features and capabilities within Triton Inference Server, employing advanced troubleshooting and debugging techniques to resolve complex technical issues