Moonlite AI delivers high-performance AI infrastructure for organizations running intensive computational research and large-scale model training. The role involves building and operating production-grade AI infrastructure with a focus on Kubernetes, ensuring enterprise-grade reliability and operational excellence.
Responsibilities:
- Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads
- Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads
- Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains
- GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization
- Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement
- Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions
- Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments
- Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR
- Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads
Requirements:
- 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale
- Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies
- Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes
- Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments
- Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead
- Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production
- Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems
- Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems
- Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency
- Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages
- Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers
- Experience building custom Kubernetes operators or controllers for infrastructure orchestration
- Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
- Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
- Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
- Experience with Kubernetes cluster federation or multi-cluster management
- Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
- Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
- Familiarity with configuration management at scale and GitOps practices
- Understanding of security best practices for Kubernetes and bare-metal infrastructure
- Experience operating infrastructure in regulated industries or co-located data center environments
- Background supporting research institutions, technical computing environments, or enterprise AI infrastructure