Designing, Developing and overseeing the strategy and architecture of scalable and reliable AI/ML Ops platforms / pipelines
Model Deployment: Package and deploy AI/ML services to production, ensuring they are reproducible and interpretable
CI/CD Pipeline Development: Design and implement automated CI/CD (Continuous Integration/Continuous Deployment) pipelines to accelerate model deployment using tools
Infrastructure Management: Provision and optimize infrastructure for training and serving, utilizing Docker, Kubernetes, or serverless platforms
Monitoring & Observability : Implement post-deployment monitoring for model performance, data drift, and latency using tools. Experience in Monte Carlo is preferable
Automation: Automate retraining and data pipeline workflows to ensure models stay accurate over time.
Manage the deployment of foundation models, fine-tuning workflows, and Retrieval-Augmented Generation (RAG) stacks (Vector DBs, Knowledge Graph. Experience with AWS Bedrock is preferable
Resource Optimization: Manage GPU/CPU utilization to minimize cloud costs while maintaining low-latency inference for users
Collaboration: Work closely with data scientists, data engineers, and software engineers to bridge the gap between model development and production.
Version Control & Governance: Manage versioning for data, code, and models using tools like MLflow.
Security & Compliance: Implementing data security measures, ensuring compliance with data governance policies, and protecting sensitive data
Technology Evaluation and Innovation: Staying abreast of emerging data technologies and exploring opportunities for innovation to improve the organisation’s data infrastructure
Troubleshooting and Problem Solving: Diagnosing and resolving complex data-related issues, ensuring the stability and reliability of the data platform
Perform other duties as assigned
Requirements
Enterprise SaaS software solutions with high availability and scalability
Solution handling large scale structured and unstructured data from varied data sources
Experience in building and maintaining AI/ML Ops platform systems ensuring scalability, reliability, efficiency and security
Working with Product engineering team to influence designs with data, AI and analytics use cases in mind
In depth experience in System design, AI/ML Frameworks and tools involving large Petabytes of data with Databricks Lakehouse ecosystem
AI/MLOps workflows on Databricks , MLFlow, Mosaic AI Agent Framework, Unity Catalog, Vector Search, Knowledge Graph
Knowledge of AI/ML frameworks like LangChain, LangGraph for AI/ML Ops pipeline integration
Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, or GCP). Experience in AWS hosted data platform is preferable
Programming languages like Python and SQL
Modern software engineering practices like Kubernetes, CI/CD, IAC tools (Preferably Terraform), Observability, monitoring and alerting
Solution Cost Optimisations and design to cost
Legally eligible to work in India on an ongoing basis.