Moonlite AI delivers high-performance AI infrastructure for organizations involved in intensive computational research and data processing. They are seeking a Senior Site Reliability Engineer to build and operate production-grade AI infrastructure, focusing on Kubernetes expertise and ensuring enterprise-grade reliability across multiple regions.
Responsibilities:
- Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads
- Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies
- Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains
- Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization
- Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement
- Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions
- Manage production bare-metal infrastructure across multiple regions
- Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments
- Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack
- Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR
- Identify and resolve performance bottlenecks across infrastructure domains
- Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads
Requirements:
- 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale
- Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters
- Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies
- Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling
- Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes
- Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments
- Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead
- Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production
- Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems
- Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems
- Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency
- Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages
- Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers
- Experience building custom Kubernetes operators or controllers for infrastructure orchestration
- Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
- Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
- Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
- Experience with Kubernetes cluster federation or multi-cluster management
- Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
- Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
- Familiarity with configuration management at scale and GitOps practices
- Understanding of security best practices for Kubernetes and bare-metal infrastructure
- Experience operating infrastructure in regulated industries or co-located data center environments
- Background supporting research institutions, technical computing environments, or enterprise AI infrastructure