Jack Henry & Associates is a technology company focused on transforming financial services for community banks and credit unions. They are seeking a Senior Site Reliability Engineer to support infrastructure growth in public and private cloud spaces while implementing Site Reliability Engineering practices to enhance system reliability and performance.

Responsibilities:

Drive the reliability and performance of both public cloud (production, testing, and development) and internal server infrastructure environments
Design and implement robust Site Reliability Engineering practices, including defining and monitoring Service Level Objectives (SLOs) and Service Level Indicators (SLIs), focusing on proactive system health and error budgets
Ruthlessly eliminate manual, repetitive work (toil) through automation. Develop and maintain automation scripts and tooling to streamline operations across the hybrid datacenter model (on-premises and public cloud)
Treat the cloud and on-prem operational environment as a software project by using Infrastructure as Code (IaC) with tools like Terraform, Ansible, GitHub for provisioning and configuration
Design and maintain rigorous configuration management processes to guarantee the consistency and desired state of the hybrid datacenter infrastructure, leveraging tools like Ansible
Establish and manage comprehensive monitoring and alerting systems to provide deep visibility into the health and performance of services. Build systems that are self-healing and advocate for themselves
Lead blameless post-mortems and RCAs for critical incidents, focusing on system-level improvements to prevent recurrence and enhance overall reliability
Develop and implement strategies for efficient patch and vulnerability management across all environments. Automate security remediation efforts to ensure timely vulnerability mitigation and compliance (e.g., CIS, NIST, PCI)
Support the company's strategic growth into public cloud services (GCP, Azure) and play a key role in the migration and redesign of services from on-premises data centers to GCP, ensuring adherence to SRE principles throughout the transition
Partner closely with DevOps and development teams to embed reliability best practices throughout the software development lifecycle, ensuring seamless integration and operation of hybrid datacenter services
Maintain comprehensive and actionable documentation for SRE processes, operational runbooks, and configurations
May perform other duties as assigned

Requirements:

Minimum 6 years of experience in cloud and hybrid datacenter operations with a focus on Infrastructure as Code (IaC) and Site Reliability Engineering
Proficiency with GCP (preferred), AWS, and/or Azure
Proficient in using Terraform and Ansible in a CI/CD (continuous integration and continuous delivery) pipeline
Experience using PowerShell, Python, or GoLang
Solid understanding of Linux (POSIX) and Windows System administration as well as networking and firewalls
Understanding of security best practices and compliance standards such as CIS, NIST and PCI
Ability to participate in an on-call rotation every 7-8 weeks
Bachelor's degree in Computer Science Information Technology, Engineering
Relevant industry certifications. Google Associate Cloud Engineer or Google Cloud Architect preferred
Proficient in ArgoCD and GitOps
Familiarity with SQL and NoSQL databases
Experience with Open Telemetry tooling and alerting such as Prometheus, Grafana, ELK Stack, et al
Experience with Site Reliability Engineering (SRE) principles, including but not limited to Service Level Objectives (SLO) and Service Level Indicators (SLI), TOIL Reduction, Automation, and Root Cause Analysis

Senior Software Engineer: Site Reliability (SRE)

Key skills

About this role

Responsibilities:

Requirements: