Supporting production applications and proactively looking for ways to automate discoveries
Eliminating incidents from recurring and/or reducing the time it takes to get customers back up and running
Improving availability, latency, performance, efficiency, and effective proactive monitoring
Interfacing with business users, development teams and system administrators
Developing, coordinating, and conducting technical reliability studies on engineering designs
Measuring and analyzing the reliability of design, materials, processes, cost, and final products
Recommending design or test methods and statistical process control procedures
Completing risk analysis studies of new designs and processes
Undertaking testing and analysis on failures, proposing changes in design or formulation to improve system and/or process reliability
Requirements
Bachelor's degree, or equivalent work experience
Five to seven years of relevant work experience in business and risk analysis, IT Service Management, production support, product/project management, or application development
Proven experience as a Site Reliability Engineer or similar role.
Strong knowledge of monitoring tools and incident management.
Proficiency in Python or Powershell
Excellent problem-solving and troubleshooting skills.
Strong experience with AWS or Azure services
Experience with Docker and container clustering technologies like AWS ECS or Kubernetes
Experience with monitoring and logging tools such as Data Dog, Splunk, Elasticsearch, Kibana and CloudWatch
Experience using GitLab/GitHub for version control and/or you’ve tracked work
Strong communication and collaboration abilities.
Financial Services industry experience a plus.
Tech Stack
AWS
Azure
Docker
ElasticSearch
ITSM
Kubernetes
Python
Splunk
Benefits
Healthcare (medical, dental, vision)
Basic term and optional term life insurance
Short-term and long-term disability
Pregnancy disability and parental leave
401(k) and employer-funded retirement plan
Paid vacation (from two to five weeks depending on salary grade and tenure)
Up to 11 paid holiday opportunities
Adoption assistance
Sick and Safe Leave accruals of one hour for every 30 worked, up to 80 hours per calendar year unless otherwise provided by law