Oxmiq Labs is a company focused on designing GPU and AI silicon for large-scale model inference and training. The Architect, AI Cloud Platform, is responsible for the inference-serving architecture, ensuring efficient model loading, scheduling, and routing across heterogeneous hardware to optimize performance and cost.

Responsibilities:

Own the inference-serving architecture end to end, including model loading, continuous batching, KV-cache management, prefix caching, request routing, and SLA-aware scheduling across heterogeneous accelerators
Lead the design of disaggregated prefill/decode deployments, including KV-cache transfer (e.g., NIXL over RDMA / InfiniBand / RoCE), KV-cache-aware request routing, and the orchestration patterns required to operate them at scale
Define the integration model between OXMIQ's ••Capsule•• runtime and the open-source inference-serving stack (vLLM, SGLang, TensorRT-LLM, llm-d, NVIDIA Dynamo, Triton Inference Server) so that serving workloads dispatch across heterogeneous silicon as a first-class capability
Partner with the orchestration team on the design of Kubernetes-based scheduling for accelerator fleets, including multi-tenant isolation, GPU and accelerator scheduling, and capacity management, ensuring it meets the needs of the inference-serving layer
Partner with the data-center infrastructure team on DC-scale provisioning, OS imaging, firmware, and burn-in validation flows for AI pods running on OXMIQ and third-party hardware, ensuring inference SLAs are achievable on the resulting fleet
Conduct architecture and code reviews and provide technical guidance to engineering leads across inference, orchestration, runtime, security, monitoring, and platform UI
Produce design documents, prototypes, and reference implementations for new platform components
Serve as the technical representative of the platform architecture in selected customer and partner engagements

Architect, AI Cloud Platform

Key skills

About this role

Responsibilities: