MongoDB is a company that empowers customers to innovate rapidly in the market. They are seeking a Senior Site Reliability Engineer to partner with teams building distributed storage services, ensuring the reliability and operational safety of the storage layer that underpins Atlas.
Responsibilities:
- Work on our multi-tenant distributed storage systems, balancing long-term strategic infrastructure goals with immediate engineering needs
- Build for reliability, making services and infrastructure available, resilient, fault-tolerant, and self-healing
- Identify and configure key metrics to detect incidents and quantify service health, availability, and performance
- Participate in a 24/7 on-call rotation to resolve issues involving the storage infrastructure
- Become an expert in infrastructure performance, helping us optimize from the application level all the way to the kernel
Requirements:
- 6+ years of experience working on software development and operating distributed systems
- Proficiency in Python, Go, or a similar language
- Have operated or supported stateful storage or database systems at scale, and are comfortable with durability, consistency, and recovery trade-offs
- Possess a customer-focused mindset
- Value efficiency in processes and operations
- Prefer automation over manual processes
- Experience using and extending containerization technologies, particularly Kubernetes, to enhance application agility, optimize resource utilization, and accelerate time-to-market
- Expertise in cloud infrastructure platforms, including AWS, Google Cloud Platform (GCP), or Azure
- Understanding of Linux operating system internals and networking concepts (e.g., TCP/IP, DNS, TLS, routing)
- Leading major architectural shifts, such as moving from legacy storage stacks to new multi-tenant storage architectures, including planning and executing large-scale data and workload migrations with tight availability and durability requirements
- Managing and scaling infrastructure across multi-cloud environments (AWS, GCP, or Azure)
- Designing secure, multi-tenant runtime environments at scale