MEMX is searching for a Systems Reliability Engineer who will be responsible for providing support for MEMX exchange platforms. The role involves incident response, system outage resolution, and improving operational processes while collaborating with cross-functional teams.
Responsibilities:
- Responsible for providing support of MEMX exchange platforms including on-call, respond to incidents and support triaging the issue
- Help isolate and resolve unplanned system outages
- Work with cross-functional teams to support the availability of all MEMX exchange platforms. This includes market operations, systems, networking and development teams
- Help improve operational processes (such as deployments and upgrades) by identifying areas which need improvement
- Document every action so that the findings turn into repeatable actions which eventually can be automated
- Debug issues as they arise, across the different services and interaction points
- Enhance monitoring and alerting based on symptoms
- Run nightly processes that are essential to exchange operations. We automate as much as possible but there are processes that require a level of manual input and attention
Requirements:
- Good understanding of Linux and know your way around Linux Shell
- Mid to advanced Linux administration, scripting skills
- Proficiency in Bash scripting skills (Python is nice to have)
- Proficiency in a configuration management tool (Ansible, Chef, Puppet)
- Experience with monitoring tools
- Familiar with incident tracking / ticketing systems and escalation procedures
- 2 years or more of experience in an operation support role with incident response
- Highly curious, driven and have attention to detail
- Seek problems to solve so to help make the platform better
- Strong urge to collaborate and improve existing processes
- Have an urge for delivering quickly and iterating fast
- Trading and/or exchange experience a plus but not required
- Share our values and work in accordance to those values