Zip Co is a digital financial services company offering innovative products, and they are seeking a Senior Machine Learning Engineer to build and scale the systems that enable production-grade machine learning and AI. The role focuses on managing the ML lifecycle, collaborating with data science and engineering teams, and solving complex distributed systems problems.
Responsibilities:
- Own and scale the infrastructure that powers production ML and AI across Zip
- Build and maintain batch and streaming feature pipelines
- Design and manage offline and online feature store patterns
- Define MLflow model registry standards and promotion workflows
- Deploy and operate scalable model serving endpoints
- Implement CI/CD for ML pipelines and model deployment
- Develop pipelines using PySpark and Spark SQL
- Optimize joins, partitioning, and shuffle-heavy workloads
- Improve reliability and cost-efficiency of distributed data jobs
- Support streaming workloads using Delta Live Tables
- Manage Databricks clusters, jobs, and access controls
- Improve observability, alerting, and operational standards
- Contribute to Lakehouse architecture (Databricks and Snowflake)
- Implement governance, RBAC, and data quality standards
- Build infrastructure that accelerates experimentation and model deployment
- Support emerging AI use cases, including real-time and large-scale ML systems
Requirements:
- 8+ years of experience in Machine Learning with a strong focus on production-grade ML and distributed data systems
- Demonstrated experience owning and operating ML systems end-to-end in production environments
- Advanced experience with PySpark and Spark SQL
- Strong understanding of Spark execution (joins, shuffles, partitioning)
- Experience building and optimizing reliable, scalable data pipelines
- Strong data engineering fundamentals including medallion architecture design, incremental/idempotent ETL patterns, and Delta Lake optimization (partitioning)
- Experience operating ML systems in production
- Hands-on experience with MLflow (tracking + model registry)
- Experience managing feature stores (offline + online)
- Experience deploying and monitoring model serving endpoints
- Experience implementing CI/CD for ML workflows
- Experience working in Azure
- Production experience with Databricks and Delta Lake
- Experience integrating with CosmosDB or similar NoSQL key-value stores
- Experience designing orchestrated, production-grade data workflows (Databricks Workflows, Airflow, or ADF) with dependency management, backfills, and failure recovery
- Delta Live Tables and streaming pipelines
- Iceberg or Lakehouse Federation experience
- Snowflake experience
- Vector databases or LLM infrastructure
- Infrastructure-as-code experience