Theoris Services is assisting their client in the search for a Data/Software Engineer to join their growing team. The role involves designing and optimizing data pipelines, implementing lakehouse architectures, and developing data visualization tools to support scientific data integration and analysis.
Responsibilities:
- Design, build, and optimize scalable data pipelines and ETL/ELT processes to integrate and harmonize scientific data (compounds, assays, experiments) from 30+ heterogeneous sources
- Implement and maintain lakehouse architectures on AWS (S3, Glue, Athena, Iceberg) to support multibillion-record datasets
- Develop federated query capabilities using Trino (or similar distributed engines) for unified access across platforms like PostgreSQL, Snowflake, and others
- Build robust backend services, RESTful APIs, and data services using Python (FastAPI, Flask preferred) to enable seamless data flow and integration with scientific tools (e.g., Benchling, computational chemistry systems, AI/ML endpoints)
- Optimize query and database performance for complex analytical workloads across PostgreSQL, Iceberg, Trino, and other platforms
- Implement caching, indexing, and query tuning techniques to improve response times and scalability as data volumes and user bases grow
- Apply reverse engineering and advanced troubleshooting skills to debug complex data issues, pipeline bottlenecks, application failures, and performance problems proactively
- Monitor systems, identify root causes, and implement fixes for data and application reliability
- Design and develop interactive dashboards, visual analytics, and scientific data visualizations using Power BI and Spotfire (or equivalent tools)
- Create reusable visualization components and data-rich UIs (React/TypeScript preferred) to enable scientists to search, filter, explore, and interpret complex datasets—including dose-response curves, chemical structures, and analytical results
- Translate scientific and engineering data into clear, actionable visual insights for researchers and stakeholders
- Apply best software engineering practices: modular/reusable design, clean code principles, code reviews, comprehensive documentation, and creation of maintainable libraries/services
- Write high-quality unit, integration, and end-to-end tests; use mock data effectively to create reliable automated test cases and ensure code stability
- Implement CI/CD pipelines for automated testing, deployment, and monitoring on AWS (EC2, ECS, Lambda, S3)
- Collaborate on full-stack features from database to frontend, ensuring end-to-end functionality, security (SSO/LDAP), and performance
- Partner with scientists, UX designers, and cross-functional teams to gather requirements, conduct user testing, and iterate on usability
- Implement data validation, quality checks, metadata management, and governance to ensure compliance and accuracy
- Contribute to engineering best practices and foster a culture of quality and scalability
Requirements:
- Bachelor's degree in Computer Science, Data Engineering, Software Engineering, Information Systems, or a related technical field
- 3+ years of professional experience in data engineering, full-stack development, or closely related roles
- Proven track record of building and delivering production-grade data pipelines, platforms, and/or user-facing scientific applications
- Intermediate to strong proficiency in Python (core for pipelines, backend, and data manipulation with pandas/PySpark); familiarity with JavaScript/TypeScript for frontend
- Hands-on experience creating scalable pipelines, ETL/ELT processes, and distributed processing (Spark, Trino/Presto)
- Deep expertise in relational databases (PostgreSQL), modern warehouses (Snowflake, Redshift), and query engines; strong focus on query performance improvement and optimization
- Practical experience with AWS services (S3, Glue, Athena, Lambda, RDS, EC2/ECS)
- Proven experience with Power BI and Spotfire (or similar) for scientific and analytical dashboards/visualizations
- Strong unit testing skills; experience writing automated tests with mock data for robust coverage
- Git for version control; API design (RESTful); CI/CD; clean code and reusable library development
- Excellent reverse engineering and troubleshooting capabilities for complex data and system issues
- Strong problem-solving skills with attention to detail and commitment to data quality/accuracy
- Ability to work independently and collaboratively in cross-functional, scientific teams
- Excellent communication skills to bridge technical concepts with non-technical stakeholders (scientists, researchers)
- Modern JavaScript/TypeScript frameworks (React preferred), responsive UI development, and component libraries