Pyramid Systems, Inc. is an award-winning technology leader driving digital transformation across federal agencies. They are seeking a Senior Data Engineer who will be responsible for designing and maintaining data architectures, ensuring alignment with business requirements, and optimizing data processing pipelines.
Responsibilities:
- Plan, create, and maintain data architectures, ensuring alignment with business requirements
- Obtain data, formulate dataset processes, and store optimized data
- Identify problems and inefficiencies and apply solutions
- Determine tasks where manual participation can be eliminated with automation
- Identify and optimize data bottlenecks, leveraging automation where possible
- Create and manage data lifecycle policies (retention, backups/restore, etc)
- In-depth knowledge for creating, maintaining, and managing ETL/ELT pipelines
- Create, maintain, and manage data transformations
- Maintain/update documentation
- Create, maintain, and manage data pipeline schedules
- Monitor data pipelines
- Create, maintain, and manage data quality gates (Great Expectations) to ensure high data quality
- Support AI/ML teams with optimizing feature engineering code
- Expertise in Spark/Python/Databricks, Data Lake and SQL
- Create, maintain, and manage Spark Structured Steaming jobs, including using the newer Delta Live Tables and/or DBT
- Research existing data in the data lake to determine best sources for data
- Create, manage, and maintain ksqlDB and Kafka Streams queries/code
- Data driven testing for data quality
- Maintain and update Python-based data processing scripts executed on AWS Lambdas
- Unit tests for all the Spark, Python data processing and Lambda codes
- Maintain PCIS Reporting Database data lake with optimizations and maintenance (performance tuning, etc)
- Streamlining data processing experience including formalizing concepts of how to handle lake data, defining windows, and how window definitions impact data freshness
- Must be able to obtain a Public Trust security clearance
- MUST BE US CITIZEN
- Experience in Conceptual/Logical/Physical Data Modeling & expertise in Relational and Dimensional Data Modeling
- Additional experience with Spark, Spark SQL, Spark DataFrames and DataSets, and PySpark
- Data Lake concepts such as time travel and schema evolution and optimization
- Structured Streaming and Delta Live Tables with Databricks a bonus
- Experience leading and architecting enterprise-wide initiatives specifically system integration, data migration, transformation, data warehouse build, data mart build, and data lakes implementation / support
- Advanced level understanding of streaming data pipelines and how they differ from batch systems
- Formalize concepts of how to handle late data, defining windows, and data freshness
- Advanced understanding of ETL and ELT and ETL/ELT tools such as SSIS, Pentaho, Data Migration Service etc
- Understanding of concepts and implementation strategies for different incremental data loads such as tumbling window, sliding window, high watermark, etc
- Familiarity and/or expertise with Great Expectations or other data quality/data validation frameworks a bonus
- Understanding of streaming data pipelines and batch systems
- Familiarity with concepts such as late data, defining windows, and how window definitions impact data freshness
- Indexing and partitioning strategy experience
- Debug, troubleshoot, design and implement solutions to complex technical issues
- Experience with large-scale, high-performance enterprise big data application deployment and solution
- Understanding how to create DAGs to define workflows
- Familiarity with CI/CD pipelines, containerization, and pipeline orchestration tools such as Airflow, Prefect, etc a bonus but not required
- Architecture experience in AWS environment a bonus
- Familiarity working with Kinesis and/or Lambda specifically with how to push and pull data, how to use AWS tools to view data in Kinesis streams, and for processing massive data at scale a bonus
- Experience with Docker, Jenkins, and CloudWatch
- Ability to write and maintain Jenkinsfiles for supporting CI/CD pipelines
- Experience working with AWS Lambdas for configuration and optimization
- Experience working with DynamoDB to query and write data
- Experience with S3
- Knowledge of Python (Python 3 desired) for CI/CD pipelines a bonus
- Familiarity with Pytest and Unittest a bonus
- Experience working with JSON and defining JSON Schemas a bonus
- Experience setting up and management Confluent/Kafka topics and ensuring performance using Kafka a bonus
- Familiarity with Schema Registry, message formats such as Avro, ORC, etc
- Understanding how to manage ksqlDB SQL files and migrations and Kafka Streams
- Ability to thrive in a team-based environment
- Experience briefing the benefits and constraints of technology solutions to technology partners, stakeholders, team members, and senior level of management
Requirements:
- 8+ years of IT experience focusing on enterprise data architecture and management
- Experience with Databricks, Structured Streaming, Delta Lake concepts, and Delta Live Tables required
- Experience with ETL and ELT tools such as SSIS, Pentaho, and/or Data Migration Services
- Advanced level SQL experience (Joins, Aggregation, Windowing functions, Common Table Expressions, RDBMS schema design, Postgres performance optimization)
- Plan, create, and maintain data architectures, ensuring alignment with business requirements
- Obtain data, formulate dataset processes, and store optimized data
- Identify problems and inefficiencies and apply solutions
- Determine tasks where manual participation can be eliminated with automation
- Identify and optimize data bottlenecks, leveraging automation where possible
- Create and manage data lifecycle policies (retention, backups/restore, etc)
- In-depth knowledge for creating, maintaining, and managing ETL/ELT pipelines
- Create, maintain, and manage data transformations
- Maintain/update documentation
- Create, maintain, and manage data pipeline schedules
- Monitor data pipelines
- Create, maintain, and manage data quality gates (Great Expectations) to ensure high data quality
- Support AI/ML teams with optimizing feature engineering code
- Expertise in Spark/Python/Databricks, Data Lake and SQL
- Create, maintain, and manage Spark Structured Steaming jobs, including using the newer Delta Live Tables and/or DBT
- Research existing data in the data lake to determine best sources for data
- Create, manage, and maintain ksqlDB and Kafka Streams queries/code
- Data driven testing for data quality
- Maintain and update Python-based data processing scripts executed on AWS Lambdas
- Unit tests for all the Spark, Python data processing and Lambda codes
- Maintain PCIS Reporting Database data lake with optimizations and maintenance (performance tuning, etc)
- Streamlining data processing experience including formalizing concepts of how to handle lake data, defining windows, and how window definitions impact data freshness
- Must be able to obtain a Public Trust security clearance
- MUST BE US CITIZEN
- Bachelor degree required
- Experience in Conceptual/Logical/Physical Data Modeling & expertise in Relational and Dimensional Data Modeling
- Additional experience with Spark, Spark SQL, Spark DataFrames and DataSets, and PySpark
- Data Lake concepts such as time travel and schema evolution and optimization
- Advanced level understanding of streaming data pipelines and how they differ from batch systems
- Formalize concepts of how to handle late data, defining windows, and data freshness
- Advanced understanding of ETL and ELT and ETL/ELT tools such as SSIS, Pentaho, Data Migration Service etc
- Understanding of concepts and implementation strategies for different incremental data loads such as tumbling window, sliding window, high watermark, etc
- Understanding of streaming data pipelines and batch systems
- Indexing and partitioning strategy experience
- Debug, troubleshoot, design and implement solutions to complex technical issues
- Experience with large-scale, high-performance enterprise big data application deployment and solution
- Understanding how to create DAGs to define workflows
- Ability to thrive in a team-based environment
- Experience briefing the benefits and constraints of technology solutions to technology partners, stakeholders, team members, and senior level of management
- Structured Streaming and Delta Live Tables with Databricks a bonus
- Familiarity and/or expertise with Great Expectations or other data quality/data validation frameworks a bonus
- Familiarity with concepts such as late data, defining windows, and how window definitions impact data freshness
- Familiarity with CI/CD pipelines, containerization, and pipeline orchestration tools such as Airflow, Prefect, etc a bonus but not required
- Architecture experience in AWS environment a bonus
- Familiarity working with Kinesis and/or Lambda specifically with how to push and pull data, how to use AWS tools to view data in Kinesis streams, and for processing massive data at scale a bonus
- Experience with Docker, Jenkins, and CloudWatch
- Ability to write and maintain Jenkinsfiles for supporting CI/CD pipelines
- Experience working with AWS Lambdas for configuration and optimization
- Experience working with DynamoDB to query and write data
- Experience with S3
- Knowledge of Python (Python 3 desired) for CI/CD pipelines a bonus
- Familiarity with Pytest and Unittest a bonus
- Experience working with JSON and defining JSON Schemas a bonus
- Experience setting up and management Confluent/Kafka topics and ensuring performance using Kafka a bonus
- Familiarity with Schema Registry, message formats such as Avro, ORC, etc
- Understanding how to manage ksqlDB SQL files and migrations and Kafka Streams