Databricks is a data and AI company that empowers data teams to solve complex problems through its advanced infrastructure platform. In this role, you will develop observability solutions that enhance the health and performance of products and infrastructure, manage extensive cloud infrastructure, and mentor engineers to foster technical excellence.
Responsibilities:
- You will build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily
- You will manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product
- You will develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics
- You will leverage powerful capabilities of Databricks’ own data intelligence platform to push the boundaries of troubleshooting practices in the industry
- You will uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations
- Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community
Requirements:
- BS (or higher) in Computer Science, or a related field
- 7+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages
- Experience in software development, in large-scale distributed systems
- Experience driving large projects involving multiple teams
- Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes
- Familiarity with observability infrastructure, monitoring patterns, and reliability practices