NVIDIA is a leading technology company known for its innovation in AI and computing. They are seeking a New College Grad for the role of Infrastructure and Build Systems Engineer, responsible for managing the infrastructure and DevOps landscape, including CI/CD pipelines and build systems, while collaborating with cross-functional teams to ensure seamless deployment of deep learning models.
Responsibilities:
- Building and maintaining infrastructure from first principles needed to deliver TensorRT LLM
- Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc
- Enable performing scans and handling of security CVEs for infrastructure components
- Improve the modularity of our build systems using CMake
- Use AI to help build automated triaging workflows
- Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform
Requirements:
- Masters degree or equivalent experience
- Experience in Computer Science, computer architecture, or related field
- Ability to work in a fast-paced, agile team environment
- Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design
- Experience with CMake
- Background with Security best practices for releasing libraries
- Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms
- Highly skilled in Kubernetes and Docker/containerd
- Automation expert with hands-on skills in frameworks like Ansible & Terraform
- Experience in AWS, Azure or GCP
- Experience contributing to a large open-source deep learning community - use of GitHub, bug tracking, branching and merging code, OSS licensing issues handling patches, etc
- Experience in defining and leading the DevOps strategy (design patterns, reliability and scaling) for a team or organization
- Experience driving efficiencies in software architecture, creating metrics, implementing infrastructure as code and other automation improvements
- Deep understanding of test automation infrastructure, framework and test analysis
- Excellent problem solving abilities spanning multiple software (storage systems, kernels and containers) as well as collaborating within an agile team environment to prioritize deep learning-specific features and capabilities within Triton Inference Server, employing advanced troubleshooting and debugging techniques to resolve complex technical issues