MongoDB is a leading database platform that empowers customers to innovate rapidly. The Site Reliability Engineer (Senior or Staff) for Observability will build and maintain the observability stack, ensuring the smooth functioning of services across the engineering teams while promoting best practices in monitoring and instrumentation.
Responsibilities:
- Define standards and vision for the mission-critical observability platform leveraged by all parts of the engineering organization
- Design, architect, build and deliver core pieces of our observability services in collaboration with other vested parties
- Design, implement, and troubleshoot the monitoring of services that seamlessly spans the globe - including several cloud providers
- Build for reliability, making services and infrastructure available, resilient, fault tolerant and self-healing
- Identify and configure key metrics to detect incidents and quantify service health, availability and performance
- Participate in a week-long on-call rotation and blameless post-mortem process
- Improve our observability capabilities, optimizing for cost, ease of use, and maintainability
Requirements:
- Experience running mission critical services at scale
- Experience with observability of large scale distributed systems
- An understanding of information security issues
- Firm grasp of at least one modern programming language, beyond basic scripting
- Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc)
- Bachelor's degree in Computer Science or equivalent experience
- Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure)
- Experience working in a kubernetes-based environment kubernetes clusters