Centific is a frontier AI data foundry that empowers organizations with safe, scalable AI deployment. The ML Ops Infrastructure Engineer will own the on-premises compute and GPU infrastructure, build and maintain Kubernetes clusters, and implement MLOps pipelines to ensure reliable platform operation in secure environments.
Responsibilities:
- Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes — including driver management, CUDA toolkit versioning, NVLink/NVSwitch topology, and firmware updates
- Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager) for health monitoring and telemetry, MIG (Multi-Instance GPU) partitioning for multi-tenant workloads, and NVIDIA Container Toolkit for GPU-aware containerization
- Manage bare-metal provisioning workflows (iPXE, PXE, or tools such as MAAS/Foreman) to enable repeatable, auditable server builds at client sites
- Monitor hardware health, capacity utilization, and thermal/power envelopes; define alerting thresholds and respond to hardware failures with minimal service disruption
- Build, upgrade, and maintain production-grade Kubernetes clusters (kubeadm or Rancher RKE2) on bare-metal infrastructure, with GPU node pools configured via the NVIDIA GPU Operator
- Design and operate cluster networking using CNI plugins appropriate for high-throughput AI workloads — Calico, Cilium, or SR-IOV for RDMA-capable networking where required
- Configure and manage MetalLB or equivalent bare-metal load balancing, ingress controllers, and service mesh components (Istio or Linkerd) for secure intra-cluster communication
- Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints to ensure AI training jobs, inference services, and platform workloads coexist without resource contention
- Maintain cluster security posture: RBAC policies, Pod Security Admission, network policies, secrets management (HashiCorp Vault or Sealed Secrets), and CIS Kubernetes Benchmark compliance
- Deploy and operate MLOps platforms (MLflow, Kubeflow, or equivalent) for experiment tracking, model versioning, and pipeline orchestration across training and inference workloads
- Configure and manage NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and model ensemble execution on GPU nodes
- Build CI/CD pipelines for model deployment (GitOps with ArgoCD or Flux), including automated model validation, canary rollouts, and rollback mechanisms
- Optimize GPU utilization for both batch training jobs (Volcano or KUEUE scheduler) and latency-sensitive inference services, tracking efficiency metrics via DCGM and Prometheus
- Manage model artifact storage and versioning using software-defined storage backends (Ceph RBD/CephFS or MinIO) integrated with the MLOps toolchain
- Design and implement the high-bandwidth network fabric required for GPU cluster interconnects — InfiniBand, RoCE v2, or high-speed Ethernet — and ensure RDMA is correctly configured for distributed training workloads
- Deploy and operate software-defined storage solutions (Ceph or equivalent) providing block, object, and file storage tiers for training datasets, model checkpoints, and platform telemetry
- Configure network segmentation, VLANs, and firewall policies to meet NIST 800-171 requirements in on-premises and air-gapped environments; document network topology for client system security plans
- Establish and maintain VPN or secure tunneling solutions for hybrid connectivity between edge nodes, on-premises clusters, and any permitted cloud services
- Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements: access control, audit logging, configuration management, incident response readiness, and media protection
- Maintain hardened OS baselines (RHEL/Rocky STIG or Ubuntu CIS benchmarks) across all infrastructure nodes; automate compliance scanning with OpenSCAP or equivalent
- Produce and maintain infrastructure documentation required for government procurement: network diagrams, hardware inventories, system security plan (SSP) contributions, and disaster recovery runbooks
- Support penetration testing engagements by providing accurate infrastructure context and remediating findings within agreed timelines
Requirements:
- 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production
- Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes
- Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations
- Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management
- Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery
- Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines
- Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence
- Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds
- Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management
- Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment
- Experience with air-gapped or classified network environments and the operational discipline they require (offline package mirrors, USB-controlled media transfers, etc.)
- Familiarity with CMMC Level 2/3 assessment processes and evidence collection
- Experience with NVIDIA DGX Systems, BasePOD reference architectures, or NVIDIA AI Enterprise software stack
- Knowledge of distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM) and their infrastructure requirements — useful for supporting AI/ML engineering teammates
- Experience deploying Kubernetes at the edge: K3s, MicroK8s, or NVIDIA Jetson-based edge clusters
- Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, and DCGM Exporter for GPU telemetry dashboards
- US Person status or active security clearance — advantageous for certain client site engagements
- Background in SCADA, ICS, or OT network environments relevant to critical infrastructure clients