KData AI is seeking a Lead Databricks ML & AI Ops Engineer to architect, scale, and govern the machine learning and AI platforms that power their enterprise data products. The role involves defining the technical vision for the MLOps lifecycle and leading a team of engineers to deliver high-performance AI features.

Responsibilities:

Define the architectural roadmap for the enterprise ML platform, steering the migration away from legacy systems (e.g., Apache Airflow) to modern Databricks Workflows and Asset Bundles
Establish, document, and enforce global engineering standards for code quality, CI/CD pipelines, and automated testing. Lead enterprise-wide ML governance and data security strategies utilizing Unity Catalog
Lead, mentor, and coach senior and mid-level engineers. Conduct advanced design and code reviews to foster a culture of technical excellence
Act as the primary technical point of contact for Databricks account teams. Represent the company at external conferences and contribute to tech blogs
Architect reusable, highly efficient batch and streaming feature pipelines using Delta Live Tables, Spark Structured Streaming, and Databricks Feature Store
Standardize cluster configurations (including multi-node GPU training) to optimize complex AutoML and deep learning workloads (PyTorch, Hugging Face) for cost and performance
Design secure, production-grade frameworks for Large Language Model (LLM) operations (LLMOps), including robust RAG pipelines, agentic workflows, and Vector Search indexing using Mosaic AI and LangChain
Oversee the global setup of MLflow, defining enterprise staging gates, automated retraining loops, and instant rollback strategies
Architect high-availability, low-latency model serving frameworks utilizing Databricks Model Serving, Mosaic AI Gateway, and containerized deployments via Kubernetes/FastAPI
Lead the standardization of Infrastructure-as-Code (IaC) using Terraform or Pulumi to automate environment provisioning across multiple cloud regions
Establish advanced, proactive monitoring for data drift, model performance decay, hallucination rates, and pipeline SLAs using Databricks Lakehouse Monitoring, Prometheus, and Grafana

Requirements:

8+ years of software, data, or DevOps engineering experience, with at least 5 years dedicated specifically to production-grade ML/AI systems. Proven experience in a technical lead or architectural capacity
Deep, expert-level knowledge of the Databricks ecosystem, including Workflows, Delta Live Tables, Unity Catalog, Mosaic AI, and Databricks Asset Bundles
Advanced understanding of Spark internals (DAG optimization, shuffle tuning, memory management) at an enterprise, multi-terabyte scale
Exceptional Python skills (writing highly optimized, type-annotated, and modular code) alongside deep familiarity with frameworks like PyTorch, scikit-learn, and XGBoost
Strong background in enterprise GitOps workflows (e.g., trunk-based development), Docker containerization, Kubernetes orchestration, and complex GitHub Actions CI/CD pipelines
Extensive hands-on experience provisioning and securing cloud infrastructure on AWS, Azure, or GCP
Databricks Certified Machine Learning Professional and/or Databricks Certified Enterprise Architect
Direct experience implementing Responsible AI frameworks, model cards, bias auditing, and cost-tracking guardrails for LLMs
Active contributor to open-source ML, MLOps, or data engineering projects (e.g., MLflow, Delta Lake, LangChain)
Experience designing Lakehouse architectures within a decentralized Data Mesh organizational framework

Lead Databricks ML & AI Ops Engineer

Key skills

About this role

Responsibilities:

Requirements: