Hyphen Connect is seeking a talented and innovative Synthetic Data Engineer. In this role, you will design and implement domain-specific synthetic data generation pipelines, ensuring high-quality data management for training loops.
Responsibilities:
- Design domain-specific synthetic data generation (SDG) pipelines via self-instruct and constitutional prompting
- Implement automated quality scoring and de-duplication systems
- Manage data pipelines that feed directly into SFT and DPO training loops
Requirements:
- Proven experience building large-scale data pipelines (Airflow, Spark, Ray)
- Deep knowledge of prompt engineering for data generation
- Familiarity with dataset distillation and bias mitigation