DockerKubernetesPythonPyTorchAIMLLLMRAGAgenticMLOpsMLflowCI/CDCommunicationRemote Work
About this role
Role Overview
Architect and evolve our multi-agent orchestration platform (currently built on Hermes / Multica)
Design and implement voice AI pipelines — STT (VibeVoice-ASR, Whisper), real-time TTS with streaming (VibeVoice-Realtime), VAD (Silero), SIP/RTP telephony integration — with sub-300 ms end-to-end latency targets
Build and maintain RAG pipelines with retrieval quality measurement, re-ranking, and hybrid search over vector + keyword indexes
Define MCP server architecture and tool-use contracts across internal and third-party integrations
Fine-tune and evaluate LLMs (LoRA, QLoRA, DPO) for domain-specific tasks including customer support, classification, and structured extraction
Own the AI observability stack: Langfuse tracing, span-level LLM call instrumentation, cost tracking, and quality regression alerting
Define and enforce guardrails: hallucination detection, PII redaction, output safety scanning, and rate-limiting across multi-tenant deployments
Build data ingestion, preprocessing, and feature pipelines supporting model training and continual learning
Drive CI/CD for ML: automated eval gating, shadow deployments, canary releases, and rollback triggers
Set architectural standards for AI systems across the group; conduct design reviews and own ADRs for major decisions
Mentor ML engineers and applied scientists; grow the team's capabilities in production AI, not just prototype AI
Collaborate with Product and Commercial teams to translate business problems into ML problem formulations with clear success metrics
Engage with external research partners and track emerging work (arXiv, conference proceedings, open-source releases) to identify signals worth productionizing
Requirements
8+ years in ML Engineering, Applied AI, or Research Engineering with at least 2 years in a lead or staff-level role
Deep, hands-on experience with LLMs in production: fine-tuning, RLHF/DPO, prompt engineering, RAG, and tool use
Fluent in Python and the core ML stack: PyTorch, Transformers (HuggingFace), PEFT/LoRA
Real experience with LLM inference serving — vLLM, TensorRT-LLM, or TGI — in a latency-sensitive production environment
Practical knowledge of agentic frameworks: multi-agent coordination, tool-call orchestration, context/memory management, and observability (Langfuse, Opik, or equivalent)
Experience with speech AI (ASR/TTS pipelines) or real-time audio systems is a strong plus
Solid understanding of MLOps: experiment tracking (MLflow/W&B), model registries, containerization (Docker/Kubernetes), and CI/CD for ML
Awareness of LLM-specific risk: hallucination, prompt injection, data leakage, fairness, and privacy — and how to mitigate them in production
Strong communication skills: you can write a crisp design doc, run a productive architecture review, and explain tradeoffs to a non-technical stakeholder.