Design and run short multi-turn conversations (typically 1–5 turns) intended to test AI personalization behavior
Create prompts grounded in realistic personal scenarios to evaluate contextual understanding
Review AI responses to determine whether personalization is correctly applied
Check grounding quality to ensure the model does not invent unsupported claims about the user
Evaluate integration quality — confirming personal signals are used naturally (not forced or robotic)
Compare two responses side-by-side and determine which is more helpful, natural, and relevant
Write clear, structured rationales explaining rankings and referencing specific conversation turns
Verify debug information showing whether correct data sources were used
Maintain strict workflow hygiene (including deleting evaluation conversations when required)
Requirements
Strong Polish proficiency (reading & writing required) — Polish is the primary evaluation language
BS/BA degree or equivalent experience in a relevant field (e.g., Policy, Law, Ethics, Linguistics, Journalism, Computer Science, or a related analytical field)
Strong analytical thinking and ability to assess nuanced AI outputs
Excellent written communication skills with ability to produce structured evaluation notes
High attention to detail when comparing similar responses
Ability to work independently in a fully remote environment
Reliable desktop/laptop and stable internet connection
Willingness to use your primary personal Google account and enable personal data sources for evaluation purposes