Architect and maintain the "Reddit Benchmark" evaluation suite: A comprehensive harness that rigorously tests model capabilities across Safety, Reasoning, and Reddit-specific knowledge (slang, norms).
Build scalable SFT (Supervised Fine-Tuning) pipelines: Implement efficient, distributed training loops for instruction tuning, converting raw base models into helpful assistants.
Develop Model-as-a-Judge systems: Engineer automated evaluation pipelines using strong models (e.g., GPT-5, Nova, Claude) to grade the outputs of our internal models, enabling rapid iteration cycles.
Execute Synthetic Data generation strategies: Create and curate high-quality instruction sets to improve model generalization where human data is scarce.
Collaborate with Safety Engineering: Translate high-level safety policies into concrete evaluation metrics and unit tests that run in our CI/CD pipelines.
Debug post-training instability: Dive deep into loss curves and evaluation logs to identify when fine-tuning is causing alignment tax or capability degradation.
Requirements
4+ years of professional experience in machine learning engineering, with a focus on LLM fine-tuning or evaluation.
Fluency in Python and PyTorch, with experience using libraries like Hugging Face Transformers, vLLM, or lm-eval-harness.
Deep understanding of Instruction Tuning (SFT) and how data quality impacts model behavior.
Experience building Evaluation Pipelines: You know the difference between MMLU, GSM8K, and how to build a custom domain-specific benchmark.
Familiarity with distributed training (FSDP/DeepSpeed) for fine-tuning jobs.
Strong data engineering skills for curating and cleaning instruction datasets.
Tech Stack
Python
PyTorch
Benefits
Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support