Grid Dynamics is a leading provider of technology consulting and advanced analytics services, and they are seeking a highly skilled Machine Learning Engineer specializing in Large Language Models for automated evaluation and quality assessment. The role involves designing systems to measure and improve model outputs, leading initiatives for evaluation pipelines, and collaborating with cross-functional teams to enhance product reliability and user experience.

Responsibilities:

Design and implement automated systems and pipelines for evaluating LLM outputs
Develop metrics and KPIs to measure output quality, accuracy, and consistency using LLM-based evaluations
Collaborate with Engineering teams to create automated logic checks and validation tools
Partner with Data Scientists to analyze evaluation results and optimize prompt and task structures
Provide feedback loops to ensure evaluation guidelines align with LLM-based assessments
Investigate how LLM-derived evaluations can enhance product reliability and user experience
Recommend refinements to prompt engineering, evaluation strategies, and automation tools
Stay informed on emerging trends in LLM evaluation, automated quality assessment, and AI toolchains
Continuously improve and expand automated evaluation processes based on industry best practices

Requirements:

5+ years of experience in ML engineering, NLP, or AI/ML automation
Advanced degree (MS/PhD) in Statistics, Data Science, Computational Social Science, Quantitative Psychology, or a related field
Strong understanding of machine learning principles with focus on NLP and advanced LLM capabilities (e.g., Chain-of-Thought, agentic workflows)
Expertise in building automated evaluation or QA pipelines
Excellent analytical and problem-solving skills with experience in root cause and error pattern analysis
Proven project management and cross-functional collaboration experience
Excellent communication skills to convey complex insights to technical and non-technical audiences
Detail-oriented mindset with a focus on evaluation metrics, prompt design, and automation
Ability to quickly adapt to new business rules and evaluation guidelines across diverse product domains
Strong programming skills in Python and SQL
Hands-on experience in prompt engineering and designing LLM-based evaluation systems
Experience with big data technologies like PySpark for data aggregation and sampling

Machine Learning Engineer – LLM Evaluation & Automation

Key skills

About this role

Responsibilities:

Requirements: