Assess the data warehouse architecture and requirements.
Map data, transformations, and processes across GCP services (Cloud Storage, BigQuery, Dataproc).
Define the data migration strategy (full load, incremental, CDC).
Develop a data architecture plan on GCP.
Data Design and Modeling on GCP:
Design table schemas in BigQuery, considering performance, cost, and scalability.
Define partitioning and clustering strategies for BigQuery.
Model data zones in Cloud Storage (Bronze, Silver, Gold).
Development of ELT/ETL Pipelines:
Create data transformation routines using Dataproc (Spark) or Dataflow to load data into BigQuery.
Translate business logic and existing transformations into GCP.
Implement data validation and quality assurance mechanisms.
Performance and Cost Optimization:
Optimize BigQuery queries to reduce costs and improve performance.
Tune and optimize Spark jobs on Dataproc.
Monitor and optimize GCP resource usage to control costs.
Data Security and Governance:
Implement and ensure data security in transit and at rest.
Define and apply IAM policies to control access to data and resources.
Ensure compliance with data governance policies.
Monitoring and Support:
Troubleshoot performance and functionality issues in data pipelines and GCP resources.
Documentation:
Document the architecture, data pipelines, data models, and operational procedures.
Communication:
Communicate effectively with team members, stakeholders, and other areas of the company.
Ensure clear communication between architecture definitions and software components, and support the evolution and quality of the team's developments.
Jira / Agile Methodologies:
Be familiar with agile methodologies, their ceremonies, and be proficient with the Jira tool.
Requirements
Proven experience with DBT of at least 3 years.
Mastery of:
models (staging, intermediate, marts)
ref() and source()
macros (Jinja)
seeds and snapshots
tests (not null, unique, custom)
Layered organization:
Staging → Transform → Mart (Data Warehouse)
Google Cloud Platform (GCP):
BigQuery: Deep expertise in data modeling, query optimization, partitioning, clustering, data loading (streaming and batch), security, and data governance.
Cloud Storage: Experience managing buckets, storage classes, lifecycle policies, access control (IAM), and data security.
Dataproc: Skills in provisioning, configuring, and managing Spark/Hadoop clusters, optimizing jobs, and integrating with other GCP services.
Dataflow/Composer/DBT: Knowledge of orchestration and data-processing tools for ELT/ETL pipelines.
Cloud IAM (Identity and Access Management): Implementing security policies and fine-grained access control.
VPC, Networking and Security: Understanding networks, subnets, firewall rules, and cloud security best practices.
Programming Languages:
Python and PySpark: Essential for automation scripts, building data pipelines, and integrating with GCP APIs.
SQL (advanced): For BigQuery, DBT, and data transformations.