Jack Henry & Associates is a technology company focused on transforming financial services for community banks and credit unions. They are seeking a Senior Site Reliability Engineer to support infrastructure growth in public and private cloud spaces while implementing Site Reliability Engineering practices to enhance system reliability and performance.
Responsibilities:
- Drive the reliability and performance of both public cloud (production, testing, and development) and internal server infrastructure environments
- Design and implement robust Site Reliability Engineering practices, including defining and monitoring Service Level Objectives (SLOs) and Service Level Indicators (SLIs), focusing on proactive system health and error budgets
- Ruthlessly eliminate manual, repetitive work (toil) through automation. Develop and maintain automation scripts and tooling to streamline operations across the hybrid datacenter model (on-premises and public cloud)
- Treat the cloud and on-prem operational environment as a software project by using Infrastructure as Code (IaC) with tools like Terraform, Ansible, GitHub for provisioning and configuration
- Design and maintain rigorous configuration management processes to guarantee the consistency and desired state of the hybrid datacenter infrastructure, leveraging tools like Ansible
- Establish and manage comprehensive monitoring and alerting systems to provide deep visibility into the health and performance of services. Build systems that are self-healing and advocate for themselves
- Lead blameless post-mortems and RCAs for critical incidents, focusing on system-level improvements to prevent recurrence and enhance overall reliability
- Develop and implement strategies for efficient patch and vulnerability management across all environments. Automate security remediation efforts to ensure timely vulnerability mitigation and compliance (e.g., CIS, NIST, PCI)
- Support the company's strategic growth into public cloud services (GCP, Azure) and play a key role in the migration and redesign of services from on-premises data centers to GCP, ensuring adherence to SRE principles throughout the transition
- Partner closely with DevOps and development teams to embed reliability best practices throughout the software development lifecycle, ensuring seamless integration and operation of hybrid datacenter services
- Maintain comprehensive and actionable documentation for SRE processes, operational runbooks, and configurations
- May perform other duties as assigned
Requirements:
- Minimum 6 years of experience in cloud and hybrid datacenter operations with a focus on Infrastructure as Code (IaC) and Site Reliability Engineering
- Proficiency with GCP (preferred), AWS, and/or Azure
- Proficient in using Terraform and Ansible in a CI/CD (continuous integration and continuous delivery) pipeline
- Experience using PowerShell, Python, or GoLang
- Solid understanding of Linux (POSIX) and Windows System administration as well as networking and firewalls
- Understanding of security best practices and compliance standards such as CIS, NIST and PCI
- Ability to participate in an on-call rotation every 7-8 weeks
- Bachelor's degree in Computer Science Information Technology, Engineering
- Relevant industry certifications. Google Associate Cloud Engineer or Google Cloud Architect preferred
- Proficient in ArgoCD and GitOps
- Familiarity with SQL and NoSQL databases
- Experience with Open Telemetry tooling and alerting such as Prometheus, Grafana, ELK Stack, et al
- Experience with Site Reliability Engineering (SRE) principles, including but not limited to Service Level Objectives (SLO) and Service Level Indicators (SLI), TOIL Reduction, Automation, and Root Cause Analysis