Advise on and help maintain large-scale computational and AI infrastructure
Provide consultative guidance and perform hands-on solving across the full stack
Assess customer environments and recommend optimized, production-ready Kubernetes-based container platforms
Serve as a key technical resource: develop, refine, and document standard methodologies and operational guidelines
Support Research & Development activities and engage in POCs/POVs to validate new features
Create and deliver high-quality documentation, including runbooks, onboarding materials, and best-practice guides
Act as the technical leader for assigned customer accounts, providing strategic guidance on DevOps and platform architecture
Requirements
BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields (or equivalent experience)
8+ years of professional experience in leading scalable cloud environments and automation engineering roles
Shown understanding of networking fundamentals, data center architectures, and hands-on experience leading HPC/AI clusters
Validated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure
Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with GPU-accelerated and HPC environments
Strong familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks
Deep knowledge of Linux (RedHat, Ubuntu), OS-level security, and protocols
Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools
Experience with observability stacks (Grafana, Loki, Prometheus)
Strong background in crafting scalable solutions and providing consultative support to customers