Penguin Solutions is seeking a Lead ML/AI Engineer to spearhead their post-sales delivery organization. In this role, you will lead a team of engineers to design and implement AI solutions, oversee production AI systems, and manage project delivery while ensuring effective communication with stakeholders.

Responsibilities:

Lead the Technical Delivery Team
Lead, mentor, and provide technical guidance to a team of ML/AI Engineers and AI Infrastructure Specialists. You will be the ultimate owner of the technical quality, reliability, and performance of all deployed solutions
Design and Implement MLOps Pipelines
Architect, design, and implement robust and automated CI/CD pipelines specifically for AI/ML models and applications. Your work will enable the rapid and reliable deployment of cutting-edge agentic AI solutions
Oversee Production AI Systems
Take charge of the operational strategy for our clients' AI environments. This includes overseeing the monitoring, scaling, maintenance, and security of production AI systems to ensure they meet stringent enterprise-grade requirements
Manage Project Delivery and Issues
Concurrently manage the technical execution of multiple customer-facing project delivery activities. You will be the primary technical point of contact for navigating and resolving issues that could impact project timeline, cost, scope, or effectiveness, driving them to a successful resolution
Drive Stakeholder Communication
Lead the presentation of project delivery status, performance metrics, and technical issue resolution plans to both internal Penguin Solutions audiences and to customers. You will be responsible for driving clear, transparent communication regarding all technical aspects of the project

Requirements:

7+ years of experience in software engineering, DevOps, or ML engineering, with at least 2 years in a technical leadership, mentorship, or lead engineer capacity
Deep, hands-on experience building and managing CI/CD pipelines (e.g., Jenkins, GitLab CI, Actions) and infrastructure-as-code (e.g., Ansible, Terraform, Puppet)
Strong, production-level experience with containerization (Docker) and container orchestration (Kubernetes)
Proficiency with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, ELK Stack)
Excellent problem-solving and troubleshooting skills, especially in complex, distributed systems
Specific experience with MLOps platforms and tools (e.g., Kubeflow, MLflow, Seldon Core)
Hands-on experience with the NVIDIA AI Enterprise stack, particularly Triton Inference Server, TensorRT-LLM, and NeMo
Experience in a customer-facing professional services or consulting role
Strong scripting and programming skills, particularly in Python or Go
Experience with deploying and managing infrastructure in both public cloud (AWS, Azure, GCP) and on-premises data center environments

Lead ML/AI Engineer

Key skills

About this role

Responsibilities:

Requirements: