Centific is a frontier AI data foundry that empowers organizations with safe, scalable AI deployment. The ML Ops Infrastructure Engineer will own the on-premises compute and GPU infrastructure, build and maintain Kubernetes clusters, and implement MLOps pipelines to ensure reliable platform operation in secure environments.

Responsibilities:

Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes — including driver management, CUDA toolkit versioning, NVLink/NVSwitch topology, and firmware updates
Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager) for health monitoring and telemetry, MIG (Multi-Instance GPU) partitioning for multi-tenant workloads, and NVIDIA Container Toolkit for GPU-aware containerization
Manage bare-metal provisioning workflows (iPXE, PXE, or tools such as MAAS/Foreman) to enable repeatable, auditable server builds at client sites
Monitor hardware health, capacity utilization, and thermal/power envelopes; define alerting thresholds and respond to hardware failures with minimal service disruption
Build, upgrade, and maintain production-grade Kubernetes clusters (kubeadm or Rancher RKE2) on bare-metal infrastructure, with GPU node pools configured via the NVIDIA GPU Operator
Design and operate cluster networking using CNI plugins appropriate for high-throughput AI workloads — Calico, Cilium, or SR-IOV for RDMA-capable networking where required
Configure and manage MetalLB or equivalent bare-metal load balancing, ingress controllers, and service mesh components (Istio or Linkerd) for secure intra-cluster communication
Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints to ensure AI training jobs, inference services, and platform workloads coexist without resource contention
Maintain cluster security posture: RBAC policies, Pod Security Admission, network policies, secrets management (HashiCorp Vault or Sealed Secrets), and CIS Kubernetes Benchmark compliance
Deploy and operate MLOps platforms (MLflow, Kubeflow, or equivalent) for experiment tracking, model versioning, and pipeline orchestration across training and inference workloads
Configure and manage NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and model ensemble execution on GPU nodes
Build CI/CD pipelines for model deployment (GitOps with ArgoCD or Flux), including automated model validation, canary rollouts, and rollback mechanisms
Optimize GPU utilization for both batch training jobs (Volcano or KUEUE scheduler) and latency-sensitive inference services, tracking efficiency metrics via DCGM and Prometheus
Manage model artifact storage and versioning using software-defined storage backends (Ceph RBD/CephFS or MinIO) integrated with the MLOps toolchain
Design and implement the high-bandwidth network fabric required for GPU cluster interconnects — InfiniBand, RoCE v2, or high-speed Ethernet — and ensure RDMA is correctly configured for distributed training workloads
Deploy and operate software-defined storage solutions (Ceph or equivalent) providing block, object, and file storage tiers for training datasets, model checkpoints, and platform telemetry
Configure network segmentation, VLANs, and firewall policies to meet NIST 800-171 requirements in on-premises and air-gapped environments; document network topology for client system security plans
Establish and maintain VPN or secure tunneling solutions for hybrid connectivity between edge nodes, on-premises clusters, and any permitted cloud services
Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements: access control, audit logging, configuration management, incident response readiness, and media protection
Maintain hardened OS baselines (RHEL/Rocky STIG or Ubuntu CIS benchmarks) across all infrastructure nodes; automate compliance scanning with OpenSCAP or equivalent
Produce and maintain infrastructure documentation required for government procurement: network diagrams, hardware inventories, system security plan (SSP) contributions, and disaster recovery runbooks
Support penetration testing engagements by providing accurate infrastructure context and remediating findings within agreed timelines

Requirements:

6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production
Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes
Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations
Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management
Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery
Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines
Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence
Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds
Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management
Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment
Experience with air-gapped or classified network environments and the operational discipline they require (offline package mirrors, USB-controlled media transfers, etc.)
Familiarity with CMMC Level 2/3 assessment processes and evidence collection
Experience with NVIDIA DGX Systems, BasePOD reference architectures, or NVIDIA AI Enterprise software stack
Knowledge of distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM) and their infrastructure requirements — useful for supporting AI/ML engineering teammates
Experience deploying Kubernetes at the edge: K3s, MicroK8s, or NVIDIA Jetson-based edge clusters
Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, and DCGM Exporter for GPU telemetry dashboards
US Person status or active security clearance — advantageous for certain client site engagements
Background in SCADA, ICS, or OT network environments relevant to critical infrastructure clients

ML Ops Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: