About this role

Keystone Recruitment is working with a global AI research client focused on improving large language models through advanced evaluation and training datasets. The role involves analyzing GitHub repositories, evaluating AI model performance, and potentially mentoring junior engineers.

Responsibilities:

Analyze and triage GitHub issues across widely used open-source repositories
Set up and configure repositories, including Dockerization and development environment automation
Evaluate unit test coverage, quality, and reliability
Run, modify, and debug real-world codebases locally to assess AI model performance in bug-fixing and implementation tasks
Collaborate with AI researchers to identify challenging repositories and issue types for LLM evaluation
Contribute to designing structured, verifiable software engineering tasks
Potentially lead and mentor junior engineers on repository validation projects

Requirements:

5+ years of professional software engineering experience
Strong expertise in at least one of the following: Python, JavaScript, Java, Go, Rust, C/C++, C#, or Ruby
Deep understanding of software architecture, debugging, and code quality standards
Proficiency with Git, Docker, and development pipeline setup
Ability to navigate and evaluate complex, production-grade codebases
Experience contributing to or reviewing open-source projects is a plus
Experience participating in AI/LLM evaluation or research initiatives
Background in building developer tools, automation systems, or code verification agents
Experience leading small engineering teams

Senior Software Engineer – LLM Evaluation (Remote)

Key skills

About this role

Responsibilities:

Requirements: