Carnegie Mellon University is a private, global research university known for its innovative contributions to education and research. The Machine Learning Department is seeking a Data Pipeline Engineer to ensure the integrity and reliability of data pipelines, monitor data quality issues, and collaborate with data scientists to implement effective data solutions.
Responsibilities:
- Monitor and maintain the health and efficiency of data pipelines
- Troubleshoot and perform root cause analysis for data discrepancies and pipeline issues
- Communicate with data providers to understand data discrepancies and manage changes in data delivery
- Implement fixes and enhancements to improve data quality and pipeline performance
- Collaborate with data scientists and analysts to understand data needs and implement effective data solutions
- Develop strategies for data validation and quality assurance
- Optimize data flow and collection to improve system efficiency
- Document and manage data pipeline architectures, including maintenance and update protocols
- Use tools such as SQL, version control and CI/CD, containerization, task schedulers, python frameworks, and cloud services for data pipeline management
- Ensure compliance with data governance and security standards
Requirements:
- Bachelor's Degree required
- Minimum one year of research computing experience required
- Basic Linux use and administration: system layout, file permissions, shell, utilities (syslog, cron), diagnostic tools (ps, htop, grep, lsof)
- Experience in Apache Airflow, preferably version 3.0
- Basic database use, especially in Postgres
- Rough script programming (Python, bash)
- Team software development (git/GitHub, Jira, code reviews, agile methodologies)
- Data analysis: diagnosing and fixing runtime errors and logic bugs; performing basic growth projections to predict future problems; communicating results
- Required technologies: Python, MySQL/Postgres, Linux, git & GitHub, Apache Airflow
- A combination of education and proven experience from which comparable knowledge is demonstrated may be considered
- Successful completion of a pre-employment background check
- Linux, Ubuntu, Bash, Make
- Apache Airflow
- Python, pandas, Flask, PyPI publishing
- SQL, Postgres
- git, GitHub, GitHub Actions, GitHub Issues
- Docker, Docker Compose
- Elastic, Kibana, FileBeat
- G Suite (Calendar, Mail, Docs, Sheets, Slides, Forms, AppsScript, Groups)
- Jira Software