Troveo AI is the largest licensable video library for AI model training, partnering with various content licensors to supply video content for research labs. The Principal or Senior Data Engineer will focus on developing scalable data pipelines and optimizing systems for handling petabyte-scale datasets, ensuring high performance and reliability.
Responsibilities:
- Design, build, and maintain scalable, efficient data pipelines in Python
- Leverage services like S3 for data storage (including multiple tiers of storage) and EC2 for compute (currently running clusters of 50k G instances), retrieval, and processing in production environments
- Develop and optimize systems to handle petabyte-scale datasets with a focus on performance, reliability, and cost-effectiveness
- Leveraging self-hosted open source LLMs and managed APIs to generate reliable metadata to power discovery and enhance the value of the content we deliver
- Building from the ground up search capabilities leveraging visual, semantic and taxonomic data to deliver the right content to our customers
- Implement robust monitoring, alerting, and logging to ensure smooth data flow and quickly troubleshoot issues
- Work cross-functionally with data scientists, software engineers, and product teams to understand data needs and deliver optimized solutions
- If applicable, process and manage video data for analytics, quality control, and other use cases
Requirements:
- Strong coding skills in Python (including familiarity with libraries for data manipulation and analysis)
- Hands-on experience using core AWS services (S3, EC2, possibly Lambda, EMR, or ECS)
- Demonstrated ability to work with large-scale datasets (petabyte-level), ensuring high performance and scalability
- Familiarity with large Postgres databases
- Comfortable building CI/CD pipelines and automating repetitive tasks
- Experience handling or transforming video data (e.g., transcoding, extracting metadata, compiling FFMPEG)
- Familiarity with ML and Computer Vision workflows or frameworks (OpenCV, TensorFlow, PyTorch, etc.)
- Understanding of AWS IAM, encryption, and SOC II compliance standards