Penguin Solutions is seeking a Lead ML/AI Engineer to spearhead their post-sales delivery organization. In this role, you will lead a team of engineers to design and implement AI solutions, oversee production AI systems, and manage project delivery while ensuring effective communication with stakeholders.
Responsibilities:
- Lead the Technical Delivery Team
- Lead, mentor, and provide technical guidance to a team of ML/AI Engineers and AI Infrastructure Specialists. You will be the ultimate owner of the technical quality, reliability, and performance of all deployed solutions
- Design and Implement MLOps Pipelines
- Architect, design, and implement robust and automated CI/CD pipelines specifically for AI/ML models and applications. Your work will enable the rapid and reliable deployment of cutting-edge agentic AI solutions
- Oversee Production AI Systems
- Take charge of the operational strategy for our clients' AI environments. This includes overseeing the monitoring, scaling, maintenance, and security of production AI systems to ensure they meet stringent enterprise-grade requirements
- Manage Project Delivery and Issues
- Concurrently manage the technical execution of multiple customer-facing project delivery activities. You will be the primary technical point of contact for navigating and resolving issues that could impact project timeline, cost, scope, or effectiveness, driving them to a successful resolution
- Drive Stakeholder Communication
- Lead the presentation of project delivery status, performance metrics, and technical issue resolution plans to both internal Penguin Solutions audiences and to customers. You will be responsible for driving clear, transparent communication regarding all technical aspects of the project
Requirements:
- 7+ years of experience in software engineering, DevOps, or ML engineering, with at least 2 years in a technical leadership, mentorship, or lead engineer capacity
- Deep, hands-on experience building and managing CI/CD pipelines (e.g., Jenkins, GitLab CI, Actions) and infrastructure-as-code (e.g., Ansible, Terraform, Puppet)
- Strong, production-level experience with containerization (Docker) and container orchestration (Kubernetes)
- Proficiency with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, ELK Stack)
- Excellent problem-solving and troubleshooting skills, especially in complex, distributed systems
- Specific experience with MLOps platforms and tools (e.g., Kubeflow, MLflow, Seldon Core)
- Hands-on experience with the NVIDIA AI Enterprise stack, particularly Triton Inference Server, TensorRT-LLM, and NeMo
- Experience in a customer-facing professional services or consulting role
- Strong scripting and programming skills, particularly in Python or Go
- Experience with deploying and managing infrastructure in both public cloud (AWS, Azure, GCP) and on-premises data center environments