NTT DATA is a leading business and technology services provider, committed to accelerating client success through responsible innovation. They are seeking an AI Data Engineer - Security with strong Kafka experience to design and operate large-scale event-streaming platforms, manage data ingestion, and optimize AWS Glue jobs.
Responsibilities:
- Kafka-Strong expertise in Kafka (4-5 years), with hands-on experience designing and operating large-scale, highly available event-streaming platforms, including partitioning strategies, consumer group optimization, schema management, and performance tuning
- API-first data ingestion. Strong hands-on pulling data from REST/GraphQL APIs with auth (OAuth2, API keys), pagination, rate limits, retries/backoff, and webhooks; strong Python skills to normalize/enrich data and land it cleanly into S3 (schema, partitioning, Parquet)
- AWS data lake, end to end. Comfortable building/operating S3-based lakes with layered zones (raw → harmonized → conformed → modeled), Glue Data Catalog, IAM/Secrets Manager, VPC endpoints, encryption, lifecycle/versioning, and cost/perf best practices (file sizing, compaction)
- AWS Glue + PySpark expert. Designs and optimizes Glue jobs using PySpark/DynamicFrames, bookmarks for incremental loads, dependency packaging, robust error handling, logging/metrics, and unit tests; knows how to tune jobs for scale and cost
- Airflow orchestration. Writes clean, parameterized, idempotent DAGs (sensors, SLAs, retries, alerts), manages dependencies across pipelines, and uses Git-based CI/CD to promote changes safely
- Snowflake proficiency. Builds ELT models (staging/ODS/marts), tunes performance (warehouse sizing, clustering, micro-partitions, caching), uses Streams/Tasks/Snowpipe for CDC, and follows solid RBAC and data governance practices
Requirements:
- Strong expertise in Kafka (4-5 years), with hands-on experience designing and operating large-scale, highly available event-streaming platforms, including partitioning strategies, consumer group optimization, schema management, and performance tuning
- Strong hands-on pulling data from REST/GraphQL APIs with auth (OAuth2, API keys), pagination, rate limits, retries/backoff, and webhooks
- Strong Python skills to normalize/enrich data and land it cleanly into S3 (schema, partitioning, Parquet)
- Comfortable building/operating S3-based lakes with layered zones (raw → harmonized → conformed → modeled), Glue Data Catalog, IAM/Secrets Manager, VPC endpoints, encryption, lifecycle/versioning, and cost/perf best practices (file sizing, compaction)
- Designs and optimizes Glue jobs using PySpark/DynamicFrames, bookmarks for incremental loads, dependency packaging, robust error handling, logging/metrics, and unit tests
- Knows how to tune jobs for scale and cost
- Writes clean, parameterized, idempotent DAGs (sensors, SLAs, retries, alerts), manages dependencies across pipelines, and uses Git-based CI/CD to promote changes safely
- Builds ELT models (staging/ODS/marts), tunes performance (warehouse sizing, clustering, micro-partitions, caching), uses Streams/Tasks/Snowpipe for CDC, and follows solid RBAC and data governance practices