About this role

Build automated evaluation pipelines for multimodal AI models.
Benchmark diffusion models, vision systems, and generative workflows.
Validate model checkpoints and detect regressions across versions.
Develop evaluation metrics for realism, consistency, and performance.
Integrate evaluation tooling into CI/CD workflows.
Collaborate with ML researchers and infrastructure teams to ensure production readiness.
Analyze failure modes and propose evaluation strategies.

Degree in Computer Science, AI, Engineering, or comparable combination of education and practical experience.
Strong programming skills in Python.
Familiarity with object-oriented programming (C++, Java, Python, or similar).
Strong data structures and algorithms fundamentals.
Understanding of machine learning experimentation workflows.
Experience evaluating vision or generative models.
Familiarity with HuggingFace ecosystem or open-source ML toolkits.
Experience building automated test frameworks or benchmarking tools.
Knowledge of diffusion models or multimodal architectures.
Experience with data analysis tools (NumPy, Pandas, visualization libraries).

Software Engineer – Model Evaluation, Benchmarking

Key skills