24-MAG is offering a specialised part-time consulting opportunity for experienced software engineering, data science, and systems design professionals. The role involves evaluating AI-generated coding outputs, validating technical accuracy, and improving AI systems' reasoning about code.
Responsibilities:
- Evaluate LLM-generated responses to coding and software engineering queries for accuracy, reasoning, clarity, and completeness
- Assess model responses across programming, data science, and systems design tasks of varying complexity
- Ensure model outputs align with expected conversational behavior and system guidelines
- Conduct fact-checking using trusted public sources and authoritative references
- Execute code and validate outputs using appropriate tools to test correctness and reliability
- Assess code quality, readability, algorithmic soundness, and explanation quality
- Annotate model responses by identifying strengths, weaknesses, and factual or conceptual inaccuracies
- Identify subtle bugs, logical flaws, inefficiencies, edge cases, and misleading explanations
- Apply consistent evaluation standards using defined taxonomies, benchmarks, and detailed evaluation guidelines
- Produce reproducible evaluation artifacts that help improve model performance and reliability
Requirements:
- A BS, MS, or PhD in Computer Science or a closely related field
- 5+ years of real-world experience in software engineering, data science, systems design, or related technical roles
- Expertise in at least two relevant programming languages such as Python, Java, C++, C, JavaScript, Go, Rust, Ruby, SQL, PowerShell, Bash, Swift, Kotlin, R, TypeScript, or HTML/CSS
- The ability to independently solve HackerRank or LeetCode medium- and hard-level problems
- Experience contributing to well-known open-source projects, including merged pull requests
- Significant experience using LLMs while coding and a strong understanding of their strengths and failure modes
- Strong attention to detail and comfort evaluating complex technical reasoning and subtle implementation flaws
- Fluent English language skills
- Prior experience with RLHF, model evaluation, or data annotation work
- Track record in competitive programming
- Experience reviewing code in production environments
- Familiarity with multiple programming paradigms or technical ecosystems
- Ability to explain complex technical concepts clearly to non-expert audiences