Stitch Fix, Inc. is the leading online personal styling service that helps people discover styles they will love. They are seeking a Staff Software Engineer to provide technical leadership for their Order Management domain, focusing on architecture, reliability, and long-term evolution of systems that power checkout and fulfillment orchestration.
Responsibilities:
- Own and evolve the technical direction of the Order Management domain, including checkout, order state, fulfillment coordination, and third-party integrations
- Design and implement resilient distributed systems with clear failure modes, graceful degradation, safe rollout and rollback strategies, and strong observability
- Establish and standardize architectural patterns for partner integrations, including timeouts, retries, circuit breaking, fallbacks, idempotency, reconciliation, and consistency guarantees
- Lead domain-wide initiatives from problem framing through production rollout and long-term hardening, ensuring solutions are not only shipped but remain reliable and maintainable
- Drive high-severity and cross-service incident response, including communication, technical decision making, root-cause analysis, and systemic remediation
- Define and uphold domain standards for testing, release safety, on-call readiness, runbooks, SLIs/SLOs, and operational excellence
- Produce clear technical designs, RFCs, and decision records that create alignment and a durable paper trail for future teams
- Partner with Product and Engineering leadership on roadmap planning, sequencing, and investment tradeoffs, framing technical decisions in terms of customer impact, risk, and business outcomes
- Mentor and coach senior and mid-level engineers, raising the quality bar for system design, operational thinking, and technical decision-making across the organization
Requirements:
- 8+ years of experience building and operating large-scale backend or distributed systems
- Strong software engineering experience across one or more modern programming languages; our current stack includes Golang, Ruby on Rails, PostgreSQL, AWS, Kafka, Temporal, though prior experience with these specific tools is not required
- Strong production experience in cloud-hosted systems (e.g., GCP, AWS, Azure), with the ability to reason about capacity, failure scenarios, and operational behavior beyond just the application layer
- Hands-on experience with relational databases such as PostgreSQL, with solid understanding of data modeling, transactional semantics, performance tuning, and operational concerns (backups, failover, migrations)
- Proven ability to design and operate business-critical, stateful systems with high availability and strict correctness requirements
- Experience working with third-party systems and designing for partial failure, latency, data consistency, and recovery
- Track record of leading architectural initiatives that span multiple teams and persist over multiple quarters
- Strong written and verbal communication skills, with the ability to clearly articulate tradeoffs, document decisions, and align diverse stakeholders
- Ability to hold strong technical opinions while remaining open to better ideas, data, and alternative perspectives
- Experience in e-commerce, order management, payments, inventory, logistics, or other high-throughput, transactional domains