EXL is a company focused on data management, and they are seeking a Manager - Databricks Platform Engineer to work with the Data Platform team on the design, build, and optimization of the Databricks ecosystem. This role involves ensuring scalable compute environments and developing new capabilities in machine learning and unified data pipelines.
Responsibilities:
- Design, deploy, and optimize Databricks workspaces and clusters on AWS to support data science and modeling workloads
- Manage Fivetran connectors and dbt projects for end-to-end data ingestion, transformation, and orchestration across Databricks, S3, and Snowflake
- Automate infrastructure provisioning and configuration using Terraform integrated with GitHub Actions
- Architect and implement high availability (HA) and disaster recovery (DR) strategies, including cross-region replication and failover
- Implement job orchestration, cluster policies, autoscaling, and CI/CD workflows for notebooks, models, and releases
- Establish monitoring and alerting for cost, usage, pipeline health, data freshness, and compliance using tools such as Datadog
- Collaborate with platform segments and security teams to enforce governance, tagging, rightsizing, SLAs, and audit readiness
- Develop and maintain Unity Catalog for governance, lineage, and secure data sharing across platforms
- Build and maintain MLflow integrations for model lifecycle management, experimentation tracking, and deployment
Requirements:
- Graduate in Computer Science, Information Technology, or a related field
- 9-12 years of experience in application development or technical support engineering
- Experience in database administration or data management
- Design, deploy, and optimize Databricks workspaces and clusters on AWS to support data science and modeling workloads
- Manage Fivetran connectors and dbt projects for end-to-end data ingestion, transformation, and orchestration across Databricks, S3, and Snowflake
- Automate infrastructure provisioning and configuration using Terraform integrated with GitHub Actions
- Architect and implement high availability (HA) and disaster recovery (DR) strategies, including cross-region replication and failover
- Implement job orchestration, cluster policies, autoscaling, and CI/CD workflows for notebooks, models, and releases
- Establish monitoring and alerting for cost, usage, pipeline health, data freshness, and compliance using tools such as Datadog
- Collaborate with platform segments and security teams to enforce governance, tagging, rightsizing, SLAs, and audit readiness
- Develop and maintain Unity Catalog for governance, lineage, and secure data sharing across platforms
- Build and maintain MLflow integrations for model lifecycle management, experimentation tracking, and deployment
- Deep expertise in Databricks administration, cluster management, job scheduling, Unity Catalog, Fivetran, and dbt (Core/Cloud)
- Proficiency in Terraform, GitHub Actions, and Python/PySpark for automation
- Strong knowledge of AWS services, including EKS, ECS, IAM, S3, VPC networking, and CloudWatch
- Experience in pipeline reliability engineering, cost optimization, ML CI/CD, HA/DR design, and observability