Luma is focused on building multimodal AI to expand human imagination and capabilities. They are seeking a hands-on engineer to architect systems for next-generation AI infrastructure and manage multi-cloud GPU clusters for training and inference.
Responsibilities:
- Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates
- Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance
- Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment
- Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level
- Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil
- Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA
Requirements:
- 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment
- Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance
- Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI
- Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect
- Startup DNA: You are energetic and thrive in a less structured, fast-paced environment
- Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO
- Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs
- Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm
- Experience managing large-scale GPU clusters for AI/ML workloads (training or inference)
- Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray