QumulusAI is seeking an experienced Site Reliability Engineer / DevOps Engineer to join our growing infrastructure team. In this role, you will be responsible for building, automating, and maintaining the reliability of our bare-metal and cloud-native GPU infrastructure platform.
Responsibilities:
- Design, implement, and maintain CI/CD pipelines, infrastructure-as-code, and automated provisioning workflows for bare-metal and virtualized GPU environments
- Build and operate comprehensive observability stacks including metrics collection, log aggregation, distributed tracing, and alerting (Prometheus, Grafana, Loki, Thanos, or equivalent)
- Manage and optimize Linux-based server fleets (Ubuntu) including PXE boot provisioning, cloud-init configuration, and automated OS lifecycle management
- Develop and maintain automation tooling using Ansible, Terraform, and/or custom scripting (Python, Bash) for large-scale infrastructure operations
- Implement and manage container orchestration platforms (Kubernetes, vCluster) and virtualization layers (KubeVirt) in production environments
- Participate in on-call rotations, incident response, post-incident review processes, and any and all engineering tasks required to maintain and advance the platform
- Provide outstanding customer service when engaging directly with customers, including technical support, onboarding assistance, and escalation resolution
- Collaborate with network engineering, infrastructure engineering, and product teams to deliver integrated platform capabilities
- Maintain and improve configuration management, secrets management, and compliance tooling across the platform
Requirements:
- 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
- Deep proficiency with Linux systems administration (Ubuntu/Debian preferred), including kernel tuning, systemd, and package management
- Strong experience with at least two: KVM, Kubernetes, Docker, Ansible, Terraform, or Packer
- Hands-on experience building and operating telemetry/observability platforms (Prometheus, Grafana, ELK/Loki, Datadog, or similar)
- Proficiency in at least one scripting/programming language: Python, Go, or Bash
- Experience with CI/CD platforms (GitLab CI, GitHub Actions, Jenkins, or ArgoCD)
- Solid understanding of networking fundamentals: TCP/IP, DNS, HTTP, load balancing, and firewall rules
- Familiarity with bare-metal server provisioning, IPMI/BMC management, and RAID configuration
- Experience operating GPU infrastructure (NVIDIA drivers, CUDA toolkit, GPU monitoring)
- Knowledge of Ceph or other distributed storage systems
- Experience with SOC2, HIPPA, ISO 27001, SOX compliance, audit logging, or regulated environments
- Background in IaaS/hosting operations or managed cloud services
- Familiarity with BGP, VXLAN, or overlay networking concepts