24-MAG is offering a specialised part-time consulting opportunity for experienced software engineering, data science, and systems design professionals. The role involves evaluating AI-generated coding outputs, validating technical accuracy, and improving AI systems' reasoning about code.

Responsibilities:

Evaluate LLM-generated responses to coding and software engineering queries for accuracy, reasoning, clarity, and completeness
Assess model responses across programming, data science, and systems design tasks of varying complexity
Ensure model outputs align with expected conversational behavior and system guidelines
Conduct fact-checking using trusted public sources and authoritative references
Execute code and validate outputs using appropriate tools to test correctness and reliability
Assess code quality, readability, algorithmic soundness, and explanation quality
Annotate model responses by identifying strengths, weaknesses, and factual or conceptual inaccuracies
Identify subtle bugs, logical flaws, inefficiencies, edge cases, and misleading explanations
Apply consistent evaluation standards using defined taxonomies, benchmarks, and detailed evaluation guidelines
Produce reproducible evaluation artifacts that help improve model performance and reliability

Requirements:

A BS, MS, or PhD in Computer Science or a closely related field
5+ years of real-world experience in software engineering, data science, systems design, or related technical roles
Expertise in at least two relevant programming languages such as Python, Java, C++, C, JavaScript, Go, Rust, Ruby, SQL, PowerShell, Bash, Swift, Kotlin, R, TypeScript, or HTML/CSS
The ability to independently solve HackerRank or LeetCode medium- and hard-level problems
Experience contributing to well-known open-source projects, including merged pull requests
Significant experience using LLMs while coding and a strong understanding of their strengths and failure modes
Strong attention to detail and comfort evaluating complex technical reasoning and subtle implementation flaws
Fluent English language skills
Prior experience with RLHF, model evaluation, or data annotation work
Track record in competitive programming
Experience reviewing code in production environments
Familiarity with multiple programming paradigms or technical ecosystems
Ability to explain complex technical concepts clearly to non-expert audiences

Remote | Software Engineering, Data Science, and Design Experts — $60–$100/hour

Key skills

About this role

Responsibilities:

Requirements: