Role Overview

Drive investigations with cross-functional teams to understand failures, analyze production defects, troubleshoot systems, identify root cause, and implement fixes to prevent recurrence
Work with peers to enhance observability, including establishing/maintaining dashboards and monitoring capabilities (e.g., Splunk/New Relic and similar tools), and improving alerting and operational readiness
Ensure high standards of quality, availability, scalability, performance, and security of internally developed applications
Continuously monitor the health and performance of engineering applications, production servers, and key service indicators; provide monitoring/reporting as needed
Support release and operational processes, including troubleshooting CI/CD pipeline issues (e.g., Jenkins pipeline) and coordinating releases as needed with partner teams
Participate in Agile sprints with cross-functional teams (multiple technologies, personnel, and processes) and contribute to continuous delivery and reliability outcomes
Identify opportunities to drive technology innovation, reliability improvements, simplifications, and process improvements
Communicate status of technical stacks, incidents, and reliability initiatives to stakeholders and leadership
Work closely with a blended team of Synchrony resources and third-party partners/contractors
Participate in an on-call rotation to respond to critical production issues

Requirements

Bachelor’s degree and a minimum of 3 years of relevant experience in application development, reliability engineering, systems engineering, and/or production application support (or equivalent practical experience) or in lieu of degree, High School/GED and 5+ years of relevant experience
Good understanding of the nature of distributed systems and cloud providers
Solid understanding of cloud concepts such as containerization, message queues, load balancing, data replication, and HA patterns
Understanding of IT application support processes, including incident management, problem resolution, and operational/support metrics used for decision-making
Knowledgeable in UNIX Operating System fundamentals
Familiar with network programming concepts and protocols
Proficiency in DevOps concepts and Site Reliability Engineering (SRE) principles, including automation, monitoring, and reliability best practices
Hands-on experience with scripting/automation in at least one language such as Python, Bash, JavaScript, PowerShell, Go, or similar
Familiar with one or more configuration automation/tools such as Terraform, Ansible, Puppet, Chef, etc.
Understanding of the infrastructure of the applications supported
Working knowledge of SDLC and Agile methodologies such as Scrum and Kanban
Strong communication skills (verbal and written) and ability to interact with multiple audiences including developers, managers, and senior executives
Customer-focus mindset; self-driven, detail-oriented; strong organizational and time management skills; operates with limited supervision
Well-developed analytical and problem-solving skills

Tech Stack

Ansible
Chef
Cloud
Distributed Systems
JavaScript
Jenkins
Puppet
Python
SDLC
Splunk
Terraform
Unix
Go

Benefits

Flexibility in working arrangements, including occasional work from home
Annual bonus based on individual and company performance

Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits