MongoDB is a leading multi-cloud database-as-a-service company that operates MongoDB Atlas. The Cloud Operations Engineer will be responsible for ensuring the operational success of Atlas customers by monitoring systems, troubleshooting incidents, and collaborating with a global team to enhance processes and tools.
Responsibilities:
- Successfully coordinate and collaborate with a global team of Cloud Operations Engineers who are tasked with ensuring our uptime guarantees to our Atlas customer base
- Help scale the worldwide Cloud Operations Engineering team with the strategic implementation and refinement of new processes and tools
- Assist in scoping, designing and deploying systems that reduce Mean Time to Resolve for customer incidents
- Monitor and detect emerging customer-facing incidents on the Atlas platform; assist in their proactive resolution
- Automate routine monitoring and troubleshooting tasks
- Diagnose live incidents, differentiate between platform issues versus usage issues, and take the next steps toward resolution
- Assist in performing root cause analysis after incident recovered; identifying any breakdowns in processes or workflows that contributed to the event and what changes need to be made to prevent similar events
- Contribute to documentation of corner case scenarios, troubleshooting workflows and SOPs
- Work alongside our product management, cloud engineering and support organizations by identifying areas for improvement in the management applications powering the Atlas infrastructure
- Inform executive leadership and escalation management personnel of major outages
- Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (proactively from automated monitoring or through reactive alerts via our Technical Services team)
Requirements:
- Experience with being an on call DevOps, SRE, or Cloud Operations engineer (at least 2 years)
- Expertise with Linux system administration, configuration, troubleshooting
- Experience in monitoring, system performance data collection and analysis, and reporting
- Knowledge of database operations and concepts
- Expertise with networking technologies like DNS, TCP/IP, etc
- Familiarity with Amazon Web Services and other Cloud infrastructure platforms (e.g. GCP, Azure)
- Knowledgeable about a wide range of web and internet technologies
- Capability to write small programs/scripts to solve both short-term systems problems
- A CS/CE degree or equivalent experience
- At least 1 of the following programming languages: Java, Go, Python, Javascript
- A keen interest in learning new things
- Be a US Citizen
- MongoDB
- Splunk
- Kubernetes