Kadence is seeking a Senior/Staff MLOps / ML Platform Engineer to build and scale the ML infrastructure that powers real-time decision models. This role focuses on platform, tooling, and operations to enable data scientists to move from notebook to production safely and repeatably.
Responsibilities:
- Design, build, and maintain core ML platform components used by the Data Science team
- Implement and own feature pipelines and feature management (e.g., feature store or equivalent), including batch and/or streaming ingestion, transformation, and serving
- Build or integrate experimentation and model tracking tools to manage datasets, configurations, model versions, and metrics
- Implement robust model monitoring in production (performance, drift, data quality, alerting) and feed findings back into the modeling lifecycle
- Partner with data scientists to understand their workflows and translate them into reliable services, libraries, and automation
- Define and enforce best practices for ML operations: testing, deployment, observability, rollback, reproducibility
- Evaluate and integrate third‑party or open‑source MLOps tools where they make sense; build bespoke components when needed
- Identify and lead initiatives that materially improve reliability, scalability, and velocity of ML development and deployment
Requirements:
- Significant experience as an MLOps / ML Platform Engineer, Machine Learning Engineer, or Software Engineer building ML‑adjacent infrastructure
- Demonstrated experience building ML platforms or major MLOps components from scratch or near‑scratch, not just maintaining existing systems
- Strong programming skills in a production language (e.g., Python, Go, or similar), with solid software engineering fundamentals (testing, code review, CI/CD)
- Experience designing and operating feature pipelines and feature management solutions (custom or tools like Feast, Tecton, etc.)
- Hands‑on experience setting up model monitoring in production (e.g., tracking performance, drift, and data quality; alerting and remediation workflows)
- Experience operating services in a modern cloud environment (AWS/GCP/Azure), including containerization (Docker) and orchestration (Kubernetes or similar)
- Background in a data‑first product company where ML is core to the business
- Experience collaborating closely with data science teams; enough ML understanding to speak their language, while being primarily an infrastructure/ops engineer
- Experience with model deployment systems (batch and/or online); this is a nice‑to‑have, not the main focus
- Prior experience in a startup or “scrappy” environment where you owned ambiguous, greenfield platform problems