Quantum World Technologies Inc. is seeking a Site Reliability Engineer to manage and improve service reliability. The role involves incident management, software development for automation, and ensuring service performance against established objectives.
Responsibilities:
- Incident Management -Create and manage necessary process involving incidents
- Partner with Ops Control to ensure IT and/or End User communications are handled appropriately
- Engage with the development team throughout the life cycle to support Application build for Reliability
- Develop software to automate manual operational work
- Run, maintain and improve the service against established Service Level Objectives by applying software engineering principles
- Responsible for the availability, performance, change (CP) management, monitoring, and capacity management of their services
- Troubleshoot priority incidents, conduct blameless post-mortems and ensure permanent closure of the incidents
- Analyze patterns of production incidents, develop permanent remediation plans, and implement automation to prevent future incidents from occurring through software engineering
- Manage process related functions around large-scale events such as disaster recovery. Communicate closely with impacted groups to ensure all events are properly managed
Requirements:
- Site Reliability Engineer (SRE) in which 80% will be support [React/Protect], 10% will be in Dev Ops[Enable] space
- Proven track record supporting large scale multi-tiered cloud-based applications
- Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns
- Hands on experience with Java, Angular, Spring, DB2, Unix scripting and experienced in scheduler tools such as TWS, autosys
- L2-L3 Production Support, Debugging skills, problem solving
- Experience working in an Agile Development environment
- Proven ability to understand and troubleshoot complex problems under pressure
- Excellent communication skills (both written and oral), listening skills, influencing and negotiation skills
- Experience with performance troubleshooting and remediation
- Experience with observability tools such as Splunk, Kibana, Grafana, Prometheus
- Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices