Implement and maintain end-to-end data pipelines for data acquisition from diverse sources, including databases, APIs, files, and messaging systems such as Kafka.
Build robust data validation, enrichment, and transformation workflows using Python and pySpark.
Develop and optimize data storage and querying layers using technologies such as Apache Iceberg, Trino, StarRocks, and Snowflake.
Implement and maintain dimensional data models, including Star and Snowflake schemas, as defined by data architecture standards.
Integrate and manage streaming data flows using Kafka for both ingestion and real-time data distribution.
Design and implement data quality checks, monitoring, and alerting to ensure high data reliability.
Contribute to metadata management, data governance, and security practices, including access controls and data masking.
Enable data distribution and consumption through files, APIs, Kafka, Snowflake data sharing, and analytics tools.
Optimize pipeline performance, cost, and scalability while troubleshooting and resolving production issues.
Collaborate closely with data architects, analysts, data scientists, and stakeholders to deliver high-quality data products.
Mentor junior engineers and promote best practices in code quality, testing, and CI/CD for data pipelines.
Requirements
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
5+ years of hands-on experience in data engineering roles, including at least 2 years working with big data or lakehouse platforms.
Strong proficiency in Python and pySpark for building scalable data processing pipelines.
Hands-on experience with analytical and query platforms such as Trino, StarRocks, and Snowflake.
Experience working with open table formats, particularly Apache Iceberg.
Proven experience with streaming technologies, especially Apache Kafka.
Solid understanding of dimensional modeling and data warehousing concepts.
Familiarity with data quality frameworks, metadata management, governance tools, and security best practices.
Experience with cloud platforms such as AWS, Azure, or GCP, and infrastructure-as-code tools.
Strong problem-solving skills with experience debugging and tuning complex data pipelines.