Partner with product, UX, and technical stakeholders to analyze business problems, clarify requirements, define scope, and translate them into measurable ML problem statements.
Design, implement, and maintain scalable, enterprise-grade ML solutions in production.
Build reproducible ML workflows for data preparation, training, evaluation, and inference using modern orchestration and MLOps tooling.
Implement monitoring and evaluation frameworks to continuously improve data quality, model performance, latency, and cost through feedback loops.
Partner cross-functionally with Product, Data Science/ML, Engineering, and Security to deliver resilient, scalable, and compliant ML-powered services.
Demonstrate end-to-end systems understanding and articulate the “why” behind model and system design choices.
Own operational excellence: SLAs, on-call, incident response, customer feedback triage, and blameless post-mortems.
Drive engineering excellence via AI-assisted SDLC, code reviews, automated testing, MLOps best practices, knowledge-sharing, and mentoring.
Actively adopt AI-assisted practices to improve implementation and collaboration efficiency.
Requirements
Strong foundation in ML/AI (statistics, probability, optimization) with the ability to apply these concepts to real-world problems.
5+ years of experience building, deploying, and operating data and ML systems in production.
Proficient in Python, Java, and SQL; strong software engineering fundamentals (system design, testing, version control, code reviews).
Hands-on experience with workflow orchestration and data pipelines (e.g., Airflow, Kubeflow) and cloud data platforms/storage (e.g., SageMaker Feature Store, Snowflake, DynamoDB, OpenSearch).
Experience with the ML lifecycle and MLOps tooling (e.g., MLflow, Metaflow, SageMaker; LLM/agent frameworks such as LangChain/LangGraph; model evaluation/observability tools such as Galileo or similar).
Working knowledge of containerization and cloud infrastructure, including Docker and Kubernetes, GitOps/CI/CD tools (e.g., Argo CD), and at least one major cloud platform (AWS, GCP, or Azure).
Understanding of data modeling and scalable systems, including distributed computing and streaming frameworks (e.g., Spark/EMR, Flink, Kafka Streams); familiarity with GPU-based implementation is a plus.
Demonstrated ability to ramp up quickly and operate effectively in new application/business domains.
Strong written and verbal communication skills: able to document and present designs and decisions, and comfortable giving/receiving feedback in an Agile environment.