Databricks is a leading data and AI company that empowers data teams to tackle complex challenges. The Sr. Staff Software Engineer in the Observability team will develop advanced observability solutions to monitor product health and performance, manage infrastructure, and enhance incident diagnosis workflows for improved reliability and monitoring practices across the organization.
Responsibilities:
- You will build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily
- You will manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product
- You will develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics
- You will leverage powerful capabilities of Databricks’ own data intelligence platform to push the boundaries of troubleshooting practices in the industry
- You will uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations
- Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community
Requirements:
- BS (or higher) in Computer Science, or a related field
- 15+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages
- Experience in software development, in large-scale distributed systems
- Experience driving large projects involving multiple teams
- Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes
- Familiarity with observability infrastructure, monitoring patterns, and reliability practices