Own the Ray ecosystem end-to-end: manage KubeRay on GKE, tune Ray Core Task/Actor scheduling, operate the Plasma distributed object store, and configure Ray Data for GPU-direct streaming from GCS/S3
Operate distributed training with Ray Train: configure TorchTrainer + DDP/NCCL for multi-node H100 clusters, manage checkpoint lifecycle, implement spot-preemption recovery, and integrate warm-start fine-tuning for retrain pipelines
Build and operate the LLM inference mesh with Ray Serve: compose vLLM (PagedAttention), SGLang (RadixAttention), and NVIDIA Triton (TensorRT/ONNX) as a unified deployment graph with Plasma zero-copy memory sharing
Optimise inference performance: configure fractional GPU allocation, enable continuous batching, implement per-engine autoscaling based on request queue depth, and tune KV-cache block sizes
Design and operate the model routing layer: capability-based, version-based, and tenant-based routing with cost-aware fallback between self-hosted SLMs and cloud LLMs
Build RL training infrastructure: define Flyte workflows for RL pipelines (rollout, reward shaping, policy update, evaluation), integrate Ray RLlib or custom PPO/GRPO loops with Ray Train, and manage replay buffer persistence on GCS
Operate the full model promotion lifecycle: quality gate → integration tests → load tests (k6) → shadow mode → A/B gate → canary (10%→100%) with golden-signal auto-rollback
Operate the retrain pipeline: drift detection triggers, warm-start retraining, relative quality gates (V2 >= V1 − 2%), and automated Flyte DAG through to canary
Integrate RAG retrieval into the inference mesh: vector similarity search, context assembly, and prompt construction before LLM inference
Requirements
Experience in ML engineering with time in an ML platform or MLOps role
Production Ray depth: Ray Train, Serve, Core, and Data — debugged real production failures including NCCL timeouts, Plasma OOM, and Serve autoscaling lag
LLM serving engines: hands-on with vLLM, SGLang, or NVIDIA Triton — PagedAttention, prefix caching, and continuous batching tuned for latency/throughput targets