NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior engineer to design and build factory infrastructure and automation for NVIDIA Inference Microservices (NIMs), focusing on creating scalable and reliable systems for AI model deployment.
Responsibilities:
- Develop a factory pipeline that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the team, define and deliver rapid iterations on the group's technical strategies and roadmaps to deliver and improve the NIM factory. You will be designing interfaces, data modeling and schema design, and expanding observability over the factory pipeline and its compute infrastructure
- Work with technical leaders designing and developing scalable and reliable factory components. You will collaborate with multiple AI model teams to understand their requirements to build an efficient infrastructure that improves every teams' productivity
- Define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you
Requirements:
- A history of using your advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies
- Effective experience working with multi-functional teams, principals and architects, across organizational boundaries
- Mentorship, growing teams and team members, and the flexibility to ability to adjust your direction and expectations given the needs of our customers
- Deep technical expertise in distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus
- Passion for building rich, microservice applications build and test automation pipeline
- Excellent interpersonal skills and the ability to lead multi-functional efforts
- Proven experience debugging and analyzing the performance of distributed microservices or cloud systems
- BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience)
- 8+ years of shown experience developing performant microservice, cloud software and/or tooling roles
- Experience delivering event-driven applications using various services such as Temporal, Kafka, Redis or others and a demonstrable ability to discuss the pros and cons of these choices
- A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines
- Prior experience in working with large scale full stack development