Role Overview

Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads.
Maintain the existing CloudStack code base used in current production deployments.
Integrate new upstream CloudStack releases into the internal platform fork.
Perform upgrades of existing customer environments to newer CloudStack versions.
Design and execute safe upgrade paths for running production environments.
Troubleshoot orchestration and provisioning issues in existing deployments.
Maintain and troubleshoot CloudStack VPC networking.
Work with and understand CloudStack Debian VPC routers.
Manage networking implementations based on Open vSwitch (OVS) and OVN.
Improve the reliability of network orchestration components.
Manage hypervisor implementations based on KVM and QEMU.
Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows.
Build orchestration workflows that expose GPU and CPU compute resources to platform users.
Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference.
Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters.
Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration.
Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.

Requirements

Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
Strong experience with CloudStack internals, including extending and maintaining platform functionality.
Experience operating cloud orchestration platforms in production environments.
Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
Strong programming skills in Go and Python, with experience building cloud-native platform components.
Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
Familiarity with workflow orchestration systems such as Argo Workflows.
Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.

Tech Stack

Cloud
Java
Kubernetes
Node.js
Python
Go

Benefits

Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

Senior AI Workload Platform Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits