NTT DATA is a leading business and technology services provider, and they are seeking a Data Engineer - AWS to join their team. The role focuses on designing, building, and maintaining scalable data pipelines and platforms to support analytics and operational decision-making, primarily involving data ingestion from Salesforce to AWS services like Amazon S3 and Redshift.
Responsibilities:
- Build and maintain pipelines that extract data from Salesforce (API-based or connector-based), land data in Amazon S3, and load into Amazon Redshift
- Implement incremental loads / CDC patterns where applicable; manage full loads and historical backfills as needed
- Establish scheduling and orchestration for daily/near-real-time jobs with reliability and retry mechanisms
- Design, develop, and optimize complex SQL in Oracle
- Analyze and convert Oracle SQL to Redshift-compatible SQL, optimizing for Redshift performance and cost
- Tune Redshift queries using best practices such as sort keys, distribution styles, and query patterns
- Design and maintain ETL/ELT jobs, transformations, and reusable frameworks
- Build and optimize data models for warehousing/lakehouse patterns (facts/dimensions, curated layers)
- Support both batch and (where applicable) near-real-time processing patterns
- Implement data quality checks (completeness, accuracy, consistency), reconciliation, and validation rules
- Ensure data integrity, metadata documentation, lineage, and governance practices
- Apply security and compliance standards (GDPR/regulatory needs where applicable)
- Monitor pipelines and infrastructure using AWS monitoring tools; troubleshoot performance and reliability issues
- Improve pipeline resilience through alerting, logging, retries, and error handling
- Contribute to modernization and cloud migration initiatives and automation (DataOps/CI-CD where relevant)
- Partner with analytics/reporting and business stakeholders to gather requirements and deliver reliable datasets
- Work effectively with cross-functional teams and provide clear documentation of pipelines and datasets
Requirements:
- Strong hands-on experience building ETL/ELT pipelines in cloud environments
- Proven experience integrating Salesforce data into a data platform (extraction, S3 landing, transformation, loading into Redshift)
- Build and maintain pipelines that extract data from Salesforce (API-based or connector-based), land data in Amazon S3, and load into Amazon Redshift
- Implement incremental loads / CDC patterns where applicable; manage full loads and historical backfills as needed
- Establish scheduling and orchestration for daily/near-real-time jobs with reliability and retry mechanisms
- Design, develop, and optimize complex SQL in Oracle
- Analyze and convert Oracle SQL to Redshift-compatible SQL, optimizing for Redshift performance and cost
- Tune Redshift queries using best practices such as sort keys, distribution styles, and query patterns
- Design and maintain ETL/ELT jobs, transformations, and reusable frameworks
- Build and optimize data models for warehousing/lakehouse patterns (facts/dimensions, curated layers)
- Support both batch and (where applicable) near-real-time processing patterns
- Implement data quality checks (completeness, accuracy, consistency), reconciliation, and validation rules
- Ensure data integrity, metadata documentation, lineage, and governance practices
- Apply security and compliance standards (GDPR/regulatory needs where applicable)
- Monitor pipelines and infrastructure using AWS monitoring tools; troubleshoot performance and reliability issues
- Improve pipeline resilience through alerting, logging, retries, and error handling
- Contribute to modernization and cloud migration initiatives and automation (DataOps/CI-CD where relevant)
- Partner with analytics/reporting and business stakeholders to gather requirements and deliver reliable datasets
- Work effectively with cross-functional teams and provide clear documentation of pipelines and datasets
- AWS: Amazon S3, Redshift, IAM, CloudWatch
- Salesforce Integration: Salesforce APIs / connectors (extraction & ingestion patterns)
- Programming & Querying: Python, SQL
- Oracle: Complex SQL, stored procedures (as needed), performance tuning
- Orchestration/Scheduling: AWS Glue, Lambda, Step Functions, cron-based scheduling (or equivalent)
- ETL tools: Informatica, Talend, Azure Data Factory
- Warehousing: Snowflake, Azure Synapse (plus Redshift as primary)
- Big data: Spark, Hadoop
- Streaming & APIs: Kafka, Event Hub, REST APIs
- DevOps/DataOps: CI/CD for data pipelines, infrastructure-as-code exposure