Moonlite AI delivers high-performance AI infrastructure for organizations involved in intensive computational research and data processing. They are seeking a Senior Site Reliability Engineer to build and operate production-grade AI infrastructure, focusing on Kubernetes expertise and ensuring enterprise-grade reliability across multiple regions.

Responsibilities:

Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads
Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies
Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains
Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization
Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement
Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions
Manage production bare-metal infrastructure across multiple regions
Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments
Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack
Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR
Identify and resolve performance bottlenecks across infrastructure domains
Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads

Requirements:

5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale
Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters
Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies
Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling
Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes
Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments
Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead
Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production
Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems
Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems
Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency
Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages
Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers
Experience building custom Kubernetes operators or controllers for infrastructure orchestration
Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
Experience with Kubernetes cluster federation or multi-cluster management
Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
Familiarity with configuration management at scale and GitOps practices
Understanding of security best practices for Kubernetes and bare-metal infrastructure
Experience operating infrastructure in regulated industries or co-located data center environments
Background supporting research institutions, technical computing environments, or enterprise AI infrastructure

Sr. Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: