KData AI is seeking a Lead Databricks ML & AI Ops Engineer to architect, scale, and govern the machine learning and AI platforms that power their enterprise data products. The role involves defining the technical vision for the MLOps lifecycle and leading a team of engineers to deliver high-performance AI features.
Responsibilities:
- Define the architectural roadmap for the enterprise ML platform, steering the migration away from legacy systems (e.g., Apache Airflow) to modern Databricks Workflows and Asset Bundles
- Establish, document, and enforce global engineering standards for code quality, CI/CD pipelines, and automated testing. Lead enterprise-wide ML governance and data security strategies utilizing Unity Catalog
- Lead, mentor, and coach senior and mid-level engineers. Conduct advanced design and code reviews to foster a culture of technical excellence
- Act as the primary technical point of contact for Databricks account teams. Represent the company at external conferences and contribute to tech blogs
- Architect reusable, highly efficient batch and streaming feature pipelines using Delta Live Tables, Spark Structured Streaming, and Databricks Feature Store
- Standardize cluster configurations (including multi-node GPU training) to optimize complex AutoML and deep learning workloads (PyTorch, Hugging Face) for cost and performance
- Design secure, production-grade frameworks for Large Language Model (LLM) operations (LLMOps), including robust RAG pipelines, agentic workflows, and Vector Search indexing using Mosaic AI and LangChain
- Oversee the global setup of MLflow, defining enterprise staging gates, automated retraining loops, and instant rollback strategies
- Architect high-availability, low-latency model serving frameworks utilizing Databricks Model Serving, Mosaic AI Gateway, and containerized deployments via Kubernetes/FastAPI
- Lead the standardization of Infrastructure-as-Code (IaC) using Terraform or Pulumi to automate environment provisioning across multiple cloud regions
- Establish advanced, proactive monitoring for data drift, model performance decay, hallucination rates, and pipeline SLAs using Databricks Lakehouse Monitoring, Prometheus, and Grafana
Requirements:
- 8+ years of software, data, or DevOps engineering experience, with at least 5 years dedicated specifically to production-grade ML/AI systems. Proven experience in a technical lead or architectural capacity
- Deep, expert-level knowledge of the Databricks ecosystem, including Workflows, Delta Live Tables, Unity Catalog, Mosaic AI, and Databricks Asset Bundles
- Advanced understanding of Spark internals (DAG optimization, shuffle tuning, memory management) at an enterprise, multi-terabyte scale
- Exceptional Python skills (writing highly optimized, type-annotated, and modular code) alongside deep familiarity with frameworks like PyTorch, scikit-learn, and XGBoost
- Strong background in enterprise GitOps workflows (e.g., trunk-based development), Docker containerization, Kubernetes orchestration, and complex GitHub Actions CI/CD pipelines
- Extensive hands-on experience provisioning and securing cloud infrastructure on AWS, Azure, or GCP
- Databricks Certified Machine Learning Professional and/or Databricks Certified Enterprise Architect
- Direct experience implementing Responsible AI frameworks, model cards, bias auditing, and cost-tracking guardrails for LLMs
- Active contributor to open-source ML, MLOps, or data engineering projects (e.g., MLflow, Delta Lake, LangChain)
- Experience designing Lakehouse architectures within a decentralized Data Mesh organizational framework