Kake is a leading expert human data platform for AI agents and LLMs, seeking a Senior Software Engineer to contribute to the development and evaluation of AI training data. In this unique role, you will leverage your technical expertise to write prompts, produce reference-quality code solutions, and evaluate AI-generated outputs to enhance AI systems.
Responsibilities:
- Create and review coding tasks based on real-world software engineering scenarios, including debugging, refactoring, code generation, API usage, automated tests, performance, security, and edge cases
- Write high-quality reference solutions that are correct, clear, testable, and aligned with task requirements
- Evaluate AI-generated code and responses using structured rubrics, assessing correctness, clarity, security, performance, maintainability, and instruction-following
- Compare multiple model responses, select the strongest answer, and justify your decision with clear technical reasoning
- Identify bugs, hallucinated APIs, missing edge cases, weak explanations, and poor engineering decisions in AI-generated outputs
- Work with terminal-based development workflows when needed, including running tests, debugging issues, managing dependencies, and navigating repositories
- Follow detailed guidelines consistently and participate in calibration activities to ensure high-quality, reliable evaluations
Requirements:
- 5+ years of professional software engineering experience in a backend, fullstack, or systems role
- Strong proficiency in at least one core programming language, ideally Python, JavaScript/TypeScript, Go, Java, C++, or SQL
- Hands-on experience with Terminal-Bench, with the ability to evaluate AI agent performance on terminal-based tasks including compiling code, running tests, managing environments, and completing multi-step software engineering workflows
- Comfortable working with Git, command line/terminal, and common development workflows
- Ability to evaluate code critically - not only whether it works, but whether it is well-designed, secure, and maintainable
- Prior experience in AI data production, RLHF, data annotation, or LLM evaluation projects
- Excellent written and verbal communication skills in English
- Ability to work independently in a remote, asynchronous, fast-paced environment
- High attention to detail and the ability to follow complex, rubric-based guidelines consistently
- Experience with Python-heavy workflows, automated testing frameworks, Docker, Linux, bash, or containerized environments
- Experience with repo-level code reasoning, large codebases, or open-source contributions
- Background in backend systems, data engineering, DevOps, infrastructure, security, or large codebase