QumulusAI is seeking an experienced Site Reliability Engineer / DevOps Engineer to join our growing infrastructure team. In this role, you will be responsible for building, automating, and maintaining the reliability of our bare-metal and cloud-native GPU infrastructure platform.

Responsibilities:

Design, implement, and maintain CI/CD pipelines, infrastructure-as-code, and automated provisioning workflows for bare-metal and virtualized GPU environments
Build and operate comprehensive observability stacks including metrics collection, log aggregation, distributed tracing, and alerting (Prometheus, Grafana, Loki, Thanos, or equivalent)
Manage and optimize Linux-based server fleets (Ubuntu) including PXE boot provisioning, cloud-init configuration, and automated OS lifecycle management
Develop and maintain automation tooling using Ansible, Terraform, and/or custom scripting (Python, Bash) for large-scale infrastructure operations
Implement and manage container orchestration platforms (Kubernetes, vCluster) and virtualization layers (KubeVirt) in production environments
Participate in on-call rotations, incident response, post-incident review processes, and any and all engineering tasks required to maintain and advance the platform
Provide outstanding customer service when engaging directly with customers, including technical support, onboarding assistance, and escalation resolution
Collaborate with network engineering, infrastructure engineering, and product teams to deliver integrated platform capabilities
Maintain and improve configuration management, secrets management, and compliance tooling across the platform

Requirements:

5+ years of experience in SRE, DevOps, or infrastructure engineering roles
Deep proficiency with Linux systems administration (Ubuntu/Debian preferred), including kernel tuning, systemd, and package management
Strong experience with at least two: KVM, Kubernetes, Docker, Ansible, Terraform, or Packer
Hands-on experience building and operating telemetry/observability platforms (Prometheus, Grafana, ELK/Loki, Datadog, or similar)
Proficiency in at least one scripting/programming language: Python, Go, or Bash
Experience with CI/CD platforms (GitLab CI, GitHub Actions, Jenkins, or ArgoCD)
Solid understanding of networking fundamentals: TCP/IP, DNS, HTTP, load balancing, and firewall rules
Familiarity with bare-metal server provisioning, IPMI/BMC management, and RAID configuration
Experience operating GPU infrastructure (NVIDIA drivers, CUDA toolkit, GPU monitoring)
Knowledge of Ceph or other distributed storage systems
Experience with SOC2, HIPPA, ISO 27001, SOX compliance, audit logging, or regulated environments
Background in IaaS/hosting operations or managed cloud services
Familiarity with BGP, VXLAN, or overlay networking concepts

Site Reliability Engineer / DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: