Stripe is a financial infrastructure platform for businesses, and they are seeking a Staff Software Engineer to help define and deliver the next generation of their Kafka-first streaming infrastructure. This role involves driving innovation to meet high availability targets and working on complex problems related to operating Kafka in production.
Responsibilities:
- Design, build, and operate event-driven infrastructure with Apache Kafka at the center, alongside technologies like Temporal and AWS services
- Partner with product and platform teams across Stripe to understand requirements, unblock Kafka adoption, and improve how streaming infrastructure is used end-to-end
- Define and implement operational best practices (e.g., shuffle sharding, cellular architecture, load shedding, automated failover) to improve resilience and reliability at scale
- Drive fleet-level automation and standardization (“pets” to “cattle”) through self-service workflows, safer rollouts, and self-healing systems that reduce manual operations
- Lead initiatives that raise the bar on Kafka availability and durability (e.g., multi-region strategies, disaster recovery readiness, operational readiness reviews, incident learning)
- Evaluate and productionize Kafka ecosystem capabilities (e.g., tiered storage, direct-to-s3) to improve cost-efficiency and scalability without compromising reliability
- Here's some examples of recent work the team has done: 6 Nines and Tiered Storage in Production?
Requirements:
- This is a Staff-level role - that typically means 10+ years of experience building, operating, and evolving large-scale production systems
- Experience as a technical lead for team(s) working on distributed systems, including scaling them in fast-moving environments
- Hands-on experience with big data technologies such as Kafka, Pulsar, Flink, or Pinot
- Comfortable operating with high autonomy and ownership
- Growth mindset and a willingness to learn quickly, explore ambiguous problem spaces, and dive deep when needed
- Strong written and verbal communication skills, including the ability to produce clear technical documentation
- Experience operating streaming technologies as a platform (e.g., Kafka, Pulsar, Flink, Pinot) for internal customers at scale
- Experience building or operating control planes for managing large-scale infrastructure