Databricks is a data and AI company that empowers data teams to solve complex problems through its advanced infrastructure platform. In this role, you will develop observability solutions that enhance the health and performance of products and infrastructure, manage extensive cloud infrastructure, and mentor engineers to foster technical excellence.

Responsibilities:

You will build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily
You will manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product
You will develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics
You will leverage powerful capabilities of Databricks’ own data intelligence platform to push the boundaries of troubleshooting practices in the industry
You will uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations
Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community

Requirements:

BS (or higher) in Computer Science, or a related field
7+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages
Experience in software development, in large-scale distributed systems
Experience driving large projects involving multiple teams
Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes
Familiarity with observability infrastructure, monitoring patterns, and reliability practices

Staff Software Engineer, Observability

Key skills

About this role

Responsibilities:

Requirements: