Oxmiq Labs is a company focused on designing GPU and AI silicon for large-scale model inference and training. The Architect, AI Cloud Platform, is responsible for the inference-serving architecture, ensuring efficient model loading, scheduling, and routing across heterogeneous hardware to optimize performance and cost.
Responsibilities:
- Own the inference-serving architecture end to end, including model loading, continuous batching, KV-cache management, prefix caching, request routing, and SLA-aware scheduling across heterogeneous accelerators
- Lead the design of disaggregated prefill/decode deployments, including KV-cache transfer (e.g., NIXL over RDMA / InfiniBand / RoCE), KV-cache-aware request routing, and the orchestration patterns required to operate them at scale
- Define the integration model between OXMIQ's ••Capsule•• runtime and the open-source inference-serving stack (vLLM, SGLang, TensorRT-LLM, llm-d, NVIDIA Dynamo, Triton Inference Server) so that serving workloads dispatch across heterogeneous silicon as a first-class capability
- Partner with the orchestration team on the design of Kubernetes-based scheduling for accelerator fleets, including multi-tenant isolation, GPU and accelerator scheduling, and capacity management, ensuring it meets the needs of the inference-serving layer
- Partner with the data-center infrastructure team on DC-scale provisioning, OS imaging, firmware, and burn-in validation flows for AI pods running on OXMIQ and third-party hardware, ensuring inference SLAs are achievable on the resulting fleet
- Conduct architecture and code reviews and provide technical guidance to engineering leads across inference, orchestration, runtime, security, monitoring, and platform UI
- Produce design documents, prototypes, and reference implementations for new platform components
- Serve as the technical representative of the platform architecture in selected customer and partner engagements