Build and own large-scale data pipelines and observability systems that power metrics, logging, and real-time insights across services.
Focus on designing reliable telemetry pipelines, improving monitoring and alerting, and ensuring data quality and system visibility at scale.
Own the compute infrastructure that powers large-scale Spark workloads.
Focus on optimizing core Spark performance, solving distributed systems challenges, and building scalable AI infrastructure, including exploring efficient ways to run smaller language models.
Requirements
Related technical degree required
Strong understanding of distributed systems design, including scalability, fault tolerance, and consistency trade-offs in large-scale data platforms
7+ years of backend software development experience building large-scale distributed systems
Strong programming skills: Java (strongly preferred); Python or Rust are strong pluses
Experience designing and operating large-scale data pipelines, ETL workflows, or streaming data systems
Experience with big data and data platform technologies such as Spark, Flink, Kafka, Trino, HBase, or similar
Strong experience with public cloud platforms, especially AWS or GCP
Strong experience with Kubernetes and container orchestration
Experience operating data platforms or infrastructure services at enterprise scale
Experience owning and operating multiple instances of mission-critical services
Experience building or operating observability systems, telemetry pipelines, or monitoring platforms
Experience using metrics, logging, and telemetry to drive operational excellence
Experience with Agile development and Test Driven Development.