Own the LLMOps pipeline: Evaluate infrastructure, prompt optimization loop, and the production integration that turns experiments into reliable customer-facing features
Design evaluation strategy per output type: Decide when to use deterministic evals (exact match, schema validation, embeddings) vs. LLM-as-judge, and build the rubrics, test datasets, and human-review loops that make the system trustworthy
Drive prompt engineering and optimization across all LLM operations in the product: Moving from hand-tuned prompts to a measurable, iterative process
Pick the right tool for each problem: Some things are LLM problems, some are embedding + classical NLP problems, some are deterministic logic
Run the production side of AI features: Observability (Langfuse /LangSmith / similar), cost and latency engineering, incident response when an LLM feature degrades
Build human-in-the-loop workflows: Review queues, feedback ingestion, labeling; so production signal feeds back into evals and prompt iteration
Mentor our AI & Analytics Intern and contribute to how we build the AI team over time
Requirements
3+ years of hands-on experience building and shipping ML/AI systems in production (we care more about what you've shipped than years on a CV)
Have shipped an LLM evaluation or prompt optimization pipeline, not just used LLMs in a project, but owned the loop
Strong hands-on experience with LLM-as-judge, including its variance problems and concrete techniques for controlling them
Solid foundation in classical NLP and ML ops: Embeddings, semantic similarity, entity matching, classification, fuzzy matching
Informed opinions on deterministic vs. LLM-based evals, from experience
Production judgment: You've owned cost and latency tradeoffs, observability, and incident response for an LLM-powered feature. You're familiar with prompt regression and have strategies for managing it
Strong Python
Excellent English communication, written and verbal: We discuss nuanced technical tradeoffs daily with the founding team and customers
Comfort with ambiguity: You can run experiments on real data, build intuition for this domain, and know when to stop iterating
Tech Stack
Python
Benefits
Hands-on ownership of a real AI product used by enterprise customers
Work directly alongside the founding team from day one
Hybrid work model: Munich North, minimum one day per week in the office, otherwise flexible (open to strong candidates elsewhere in the EU for the right fit); onboarding will take in-office
A steep learning curve at the intersection of LLM engineering, enterprise GRC, and startup operations