Stripe is a financial infrastructure platform for businesses, aiming to increase the GDP of the internet. The Staff Software Engineer in Stream Compute will define and deliver the next generation of Stripe's Flink-first stream compute infrastructure, focusing on high availability and reliable operations for global scale.
Responsibilities:
- Design, build, and operate stream compute infrastructure with Apache Flink at the center, alongside technologies like Kafka, Temporal, and AWS services
- Partner with product and platform teams across Stripe to understand requirements, unblock Flink adoption, and improve how stream processing infrastructure is used end-to-end
- Define and implement operational best practices (e.g., shuffle sharding, cellular architecture, load shedding, automated state recovery) to improve resilience and reliability at scale
- Drive fleet-level automation and standardization ("pets" to "cattle") through self-service workflows, safer rollouts, and self-healing systems that reduce manual operations
- Lead initiatives that raise the bar on Flink availability and state durability (e.g., multi-region strategies, disaster recovery readiness, operational readiness reviews, incident learning)
- Evaluate and productionize Flink ecosystem capabilities (e.g., SQL, connectors, state backends) to improve developer experience and scalability without compromising reliability
- Work closely with the open source community to identify opportunities for adopting new open source features as well as contribute back to OSS
Requirements:
- This is a Staff-level role - that typically means 10+ years of experience building, operating, and evolving large-scale production systems
- Experience as a technical lead for team(s) working on distributed systems, including scaling them in fast-moving environments
- Hands-on experience with big data technologies such as Flink, Spark, Kafka, Pulsar, or Pinot
- Experience developing, maintaining and debugging distributed systems built with open source tools
- Experience building and scaling infrastructure as a product
- Strong software engineering skills and a passion for Big Data Distributed Systems
- Ability to write high quality code (in programming languages like Go, Java, Scala, etc)
- Comfortable operating with high autonomy and ownership
- Growth mindset and a willingness to learn quickly, explore ambiguous problem spaces, and dive deep when needed
- Strong written and verbal communication skills, including the ability to produce clear technical documentation
- Experience operating streaming infrastructure as a platform (e.g., Flink clusters, Kafka, Pulsar) for internal customers at scale
- Deep hands-on experience authoring, optimizing, and operating real-time processing frameworks such as Flink, Spark Streaming, Storm, or Kafka Streams in production
- Experience building or operating control planes for managing large-scale infrastructure
- Open source contributions to data processing or big data systems (Hadoop, Spark, Celeborn, Flink, etc)