MasterClass is the streaming platform where the world’s best come together to share their knowledge and stories. They are seeking a Staff Infrastructure Engineer who will act as a technical leader, designing scalable systems and enhancing infrastructure across the company. This role involves leading cloud infrastructure initiatives, partnering with various teams, and providing technical direction for infrastructure systems.
Responsibilities:
- Set technical direction for core infrastructure systems, balancing reliability, scalability, security, and developer velocity
- Design and lead implementation of complex, cross-team infrastructure initiatives (AWS, EKS, CI/CD, shared platforms)
- Build production-grade software to power infrastructure automation, tooling, and internal platforms
- Drive org-wide improvements in observability, incident response, reliability engineering, and cost efficiency
- Lead video infrastructure initiatives to evolve video delivery and CDN strategies for global scale and performance
- Partner with AI and Data teams to define and build GPU-powered infrastructure, model-serving platforms, and GenAI inference/training workflows
- Act as a technical mentor and reviewer, raising the bar for infrastructure design and operational excellence across teams
- Lead by example during incidents, postmortems, and critical architectural decisions
Requirements:
- 10+ years building and operating cloud-native production systems
- Strong expertise in: AWS architecture and large-scale cloud operations
- Strong expertise in: Kubernetes / EKS (networking, RBAC, Helm, autoscaling, cluster-level debugging)
- Strong expertise in: Infrastructure as Code (Terraform preferred)
- Strong expertise in: CI/CD and automation (GitHub Actions, Argo, or similar)
- Proven ability to design and build infrastructure software, not just operate systems
- Deep understanding of distributed systems, networking, Linux internals, and cloud security
- A track record of driving step-function improvements in reliability, scalability, or developer experience
- Strong written and verbal communication skills; able to influence technical decisions across teams
- AI/ML infrastructure, GPU scheduling, or distributed training
- Model serving frameworks (e.g., Triton, vLLM, TensorRT-LLM)
- Infrastructure autoscaling strategies, including application workloads, databases, and GenAI workloads
- KEDA, Karpenter, service mesh, or multi-cluster architectures
- HLS/DASH packaging or video encoding pipelines