Stitch Fix is redefining retail by combining human creativity with advanced data science and Generative AI. They are seeking a Director of Data & AI/ML Platform Engineering to lead the engineering organization responsible for the enterprise data platform, machine learning platform, and generative AI platform, driving product vision and execution across these critical areas.
Responsibilities:
- Data infrastructure at scale. The systems that ingest, store, and make data accessible across the company - petabyte-scale lakehouse, event streaming, workflow orchestration, data governance, and the self-service tools that make this infrastructure usable without platform team involvement at every step
- Machine learning platform. The infrastructure that enables data scientists and engineers to build, experiment, and serve models in production at speed - feature stores, training pipelines, distributed model serving, and the MLOps practices that keep production models healthy, observable, and improving
- Generative AI platform. The platform that enables teams across the company to build, deploy, and govern AI agents and GenAI-powered applications - runtime and routing infrastructure, self-service agent-building tools, context and retrieval management, observability and evaluation frameworks, and the cost and safety controls that keep AI reliable, governed, and improving in production
- The next generation of personalization and decisioning. The foundational platform work behind the company's highest-priority strategic initiatives - partnering with Data Science, Algorithms, and Product to build the next generation of intelligence infrastructure: deeper understanding of clients, products, and style, powered by real-time data, AI reasoning, and systems that continuously improve
- Set and own the product vision for each platform area. Treat internal platforms as products. Understand your users, define north star metrics for platform health and adoption, build a roadmap that earns trust, and communicate the vision in a way that rallies engineers and gains stakeholder buy-in
- Own platform modernization decisions. Lead strategic architectural shifts - open table format migration, feature store re-foundation, model serving modernization, agentic AI infrastructure buildout - on behalf of users and stakeholders. Drive these from problem definition through adoption, not just implementation
- Compress time from idea to production. Build the developer experience, self-service tooling, and golden paths that reduce friction for every type of user - from engineers and data scientists building pipelines and models, to analysts exploring data in BI tools, to business operators building and running AI-assisted workflows. Speed to insight and speed to production are both critical
- Lead and grow the organization. Manage engineering managers and senior ICs across three platform areas. Create clarity, remove blockers, and develop people - while continuously evolving how the team works, applying the AI capabilities you build to accelerate your own org's velocity and shaping the skills and structure the team needs for an AI-first engineering model
- Drive cross-functional alignment. Partner with Data Science, ML Engineering, Data Engineering, Product, and Business leaders to align platform investment with business priorities. Represent the platform in quarterly planning, architecture reviews, and executive forums
- Communicate with authority at every level. Write crisp strategy documents. Present platform trade-offs to the C-suite. Sit with an engineer and whiteboard a system design. Fluency across these modes is a requirement, not a nice-to-have
- Run the business. Own budget, headcount planning, vendor relationships, contractor management, and the long-horizon platform strategy. Balance investment in new capabilities with operational excellence and the reduction of legacy
Requirements:
- 10+ years in software, data, or ML/AI platform engineering; 5+ years leading engineering managers or multi-team platform organizations
- Track record of owning and evolving production-grade platform systems at scale - not just building them, but driving adoption, rationalizing legacy, and measurably improving developer and data science productivity over time
- History of making and landing consequential architectural decisions in complex, high-availability environments; comfort with the full lifecycle from design through post-launch iteration
- Hands-on experience with distributed compute and storage (Spark, Trino/Presto, Apache Iceberg or Delta Lake), event streaming (Kafka, Flink), workflow orchestration (Airflow), and data governance and quality systems
- Feature engineering and feature stores, model training pipelines, model deployment and serving (Ray Serve, Triton, or equivalent), monitoring and validation, and the operational practices of running ML in production (MLOps)
- LLM orchestration frameworks, retrieval-augmented generation (RAG), agent architectures, evaluation frameworks, cost and latency governance, and the emerging standards around agentic AI (Model Context Protocol or equivalent)
- Experience building internal developer platforms (IDPs), self-service tooling, and platform abstractions that reduce friction for engineering teams; familiarity with developer experience metrics and platform adoption patterns
- Distributed systems design, container orchestration (Kubernetes), and cloud infrastructure at scale (AWS preferred)
- Product-led mindset. You approach internal platforms the same way a strong product leader approaches external products: segmented user personas, defined success metrics, a prioritized roadmap, and a bias toward adoption and impact over feature completeness
- 360-degree execution. You own the full loop - discovery and planning, iterative delivery, production quality, user enablement and evangelism, and the feedback loops that close on real-world impact
- Strategic communication and influence. You can make a compelling case for a multi-year platform investment to a CxO, write a technical design doc your engineers will actually follow, and give a data scientist a useful answer about why their job is slower than it should be. Each of these is a different skill; you have all three
- You represent users' needs inside the platform team. You hold the bar on developer experience, self-service reliability, and documentation quality. You treat user complaints as signal, not noise