Sparibis is seeking a Data Engineer to join their team. The role involves planning, creating, and maintaining data architectures, ensuring alignment with business requirements while optimizing data processes and automating tasks where possible.
Responsibilities:
- Plan, create, and maintain data architectures, ensuring alignment with business requirements
- Obtain data, formulate dataset processes, and store optimized data
- Identify problems and inefficiencies and apply solutions
- Determine tasks where manual participation can be eliminated with automation
- Identify and optimize data bottlenecks, leveraging automation where possible
- Create and manage data lifecycle policies (retention, backups/restore, etc)
- In-depth knowledge for creating, maintaining, and managing ETL/ELT pipelines
- Create, maintain, and manage data transformations
- Maintain/update documentation
- Create, maintain, and manage data pipeline schedules
- Monitor data pipelines
- Create, maintain, and manage data quality gates (Great Expectations) to ensure high data quality
- Support AI/ML teams with optimizing feature engineering code
- Expertise in Spark/Python/Databricks, Data Lake and SQL
- Create, maintain, and manage Spark Structured Steaming jobs, including using the newer Delta Live Tables and/or DBT
- Research existing data in the data lake to determine best sources for data
- Create, manage, and maintain ksqlDB and Kafka Streams queries/code
- Data driven testing for data quality
- Maintain and update Python-based data processing scripts executed on AWS Lambdas
- Unit tests for all the Spark, Python data processing and Lambda codes
- Maintain PCIS Reporting Database data lake with optimizations and maintenance (performance tuning, etc)
- Streamlining data processing experience including formalizing concepts of how to handle lake data, defining windows, and how window definitions impact data freshness
Requirements:
- 5+ years Professional Experience
- Bachelor's Degree in IT related field
- Applicants must be able to obtain and maintain a secret security clearance. United States Citizenship is required as part of the eligibility criteria to be able to obtain this type of security clearance
- CompTIA Security +
- 5+ years of IT experience focusing on enterprise data architecture and management to include data flow charts, diagrams, and other technical documentation
- Experience with Databricks, Structured Streaming, Delta Lake concepts, and Delta Live Tables required
- Python development experience required
- Experience with ETL and ELT tools such as SSIS, Pentaho, and/or Data Migration Services, and the ability to incorporate Python as required
- Advanced level SQL experience (Joins, Aggregation, Windowing functions, Common Table Expressions, RDBMS schema design, Postgres performance optimization)
- Proficiency using Git for version control, including repository management, branching, merging, and pull requests
- Must have an active Secret security clearance
- Experience in Conceptual/Logical/Physical Data Modeling & expertise in Relational and Dimensional Data Modeling
- Knowledge of Python (Python 3.X) for CI/CD pipelines required
- Advanced level understanding of streaming data pipelines and how they differ from batch systems
- Advanced understanding of ETL and ELT and ETL/ELT tools such as SSIS, Pentaho, Data Migration Service etc
- Understanding of concepts and implementation strategies for different incremental data loads such as tumbling window, sliding window, high watermark, etc
- Debug, troubleshoot, design and implement solutions to complex technical issues
- Experience with large-scale, high-performance enterprise big data application deployment and solution
- Understanding how to create DAGs to define workflows
- Ability to thrive in a team-based environment
- Experience briefing the benefits and constraints of technology solutions to technology partners, stakeholders, team members, and senior level of management
- Repository setup and management
- Branching strategies (feature, develop, main)
- Merging and resolving conflicts
- Creating and reviewing pull requests
- Commit best practices (clear messages, atomic commits)
- Tagging and release management
- Active CompTIA Security+ certification preferred. If selected, must be able to obtain a CompTIA Security+ certification prior to beginning supporting the program
- Additional experience with Spark, Spark SQL, Spark DataFrames and DataSets, and PySpark
- Structured Streaming and Delta Live Tables with Databricks a bonus
- Familiarity with Pytest and Unittest a bonus
- Familiarity and/or expertise with Great Expectations or other data quality/data validation frameworks a bonus
- Familiarity with concepts such as late data, defining windows, and how window definitions impact data freshness
- Familiarity with CI/CD pipelines, containerization, and pipeline orchestration tools such as Airflow, Prefect, etc a bonus but not required
- Architecture experience in AWS environment a bonus
- Familiarity working with Kinesis and/or Lambda specifically with how to push and pull data, how to use AWS tools to view data in Kinesis streams, and for processing massive data at scale a bonus
- Experience with Docker, Jenkins, and CloudWatch
- Experience working with AWS Lambdas for configuration and optimization
- Experience working with DynamoDB to query and write data
- Experience with S3
- Experience working with JSON and defining JSON Schemas a bonus
- Experience setting up and management Confluent/Kafka topics and ensuring performance using Kafka a bonus
- Familiarity with Schema Registry, message formats such as Avro, ORC, etc
- Understanding how to manage ksqlDB SQL files and migrations and Kafka Streams