Wizard AI is developing a leading AI Shopping Agent that provides top products with exceptional accuracy and quality. They are seeking a Senior Machine Learning Engineer to oversee the production lifecycle of ML serving systems, ensuring reliability, efficiency, and scalability in a dynamic production environment.
Responsibilities:
- Own and evolve our multi-engine inference platform, supporting a variety of model types and serving requirements
- Build and improve production ML pipelines — taking models from experimentation to reliable, high-throughput serving
- Define and implement model versioning, rollout, rollback, and lifecycle management strategies that ensure reproducibility and operational reliability
- Define and enforce serving-layer SLAs, including latency, availability, GPU utilization, Time-to-First-Token (TTFT), and Inter-Token Latency (ITL)
- Build observability, monitoring, alerting, and operational tooling for production inference systems
- Apply software engineering best practices, including testing, CI/CD integration, and reproducibility across ML workflows
- Optimize inference performance through efficient resource utilization, hardware-aware serving strategies, and cost-conscious infrastructure design
- Ensure ML serving systems are secure, scalable, and operationally resilient
- Partner with ML, Data, Product, and DevOps teams to turn ideas into production systems, driving the technical decisions on serving and scale
Requirements:
- Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related field, or equivalent practical experience
- 5–8+ years of experience in Software Engineering, ML Engineering, Platform Engineering, or Infrastructure Engineering, with direct ownership of production ML serving systems
- Hands-on experience running an LLM serving engine (vLLM, TGI, TensorRT-LLM, or SGLang) in production under real load — not just managed or hosted endpoints
- Strong Python skills and software engineering fundamentals, combined with deep systems and infrastructure knowledge
- Experience with cloud platforms such as AWS, GCP, or Azure, and familiarity with ML lifecycle tooling, experimentation platforms, and model registries
- Strong grasp of inference performance — continuous batching, KV-cache and GPU-memory behavior, quantization, and CPU-versus-GPU bottlenecks — with the instinct to profile before tuning
- Experience serving heterogeneous workloads, including LLMs, embedding models, and extraction models, each with distinct latency, throughput, and scaling requirements
- Demonstrated ability to balance latency, throughput, reliability, and infrastructure cost while operating production-scale ML systems
- Experience in high-growth startup environments and comfort operating in fast-moving, evolving technical landscapes