Role Overview

Oversee a team of Site Reliability engineers responsible for Identifying automation opportunities and implement tools and processes that streamline routine tasks, enable scalable infrastructure, and support seamless deployments.
Lead improvement of the reliability and availability of critical applications, platforms, and server infrastructure through proactive monitoring, incident management, and resiliency improvements.
Guide the team to develop and track new service level indicators to support SLO and SLA compliance.
Evaluate and interpret monitoring and alerting solutions that improve visibility into infrastructure, application performance, and user experience.
Formulate and execute strategic initiatives to enhance efficiency, including capacity planning, disaster recovery, and business continuity measures.
Recommend and implement improvements to disaster recovery plans, backup strategies, and failover mechanisms.
Ensure ongoing compliance with industry regulations, standards, and best practices, particularly in data security and privacy.
Maintain up-to-date knowledge of emerging technologies and trends in Site Reliability Engineering, SaaS platform server management and fintech to drive continuous innovation within the team.
Supervise maintenance, configuration, and reliability of all data center infrastructure, including servers, networks, and storage systems.
Delivers a production server operations environment that meets all service level agreements, processing service level objectives, response time targets, and availability targets.
Oversee data security protocols and maintain adherence to regulatory and industry standards.
Lead incident management processes, ensuring rapid resolution and clear communication with stakeholders.
Identify and drive improvements in reliability, performance, and efficiency through data and root cause analysis.
Participate in an on-call rotation to support critical production incidents.
Strategically manage capacity to support future growth, ensuring the data center adapts to increasing demands without compromising security or performance.
Partner with cross-functional teams to align data center operations with overall organizational objectives.
Partner with development, QA, DevOps, and product teams to influence design and drive application resiliency improvements.
Proactively identify operational risks and develop strategies to mitigate disruptions or data breaches.
Conduct regular service level reviews to evaluate platform and application performance, and manage a structured feedback loop to identify, track, and resolve recurring technology and application issues.

Requirements

Extensive experience managing mission critical platforms, applications services, including at least 5 years in a leadership capacity.
7-10+ years of management experience in software development life cycle.
Possesses solid technical knowledge or at least a fundamental grasp of the key principles related to the technologies mentioned below:
Mainframe Technologies: COBOL, RPG, (Preferred): JCL, CICS, SQL, CL, DDS, DDL, JES, and mainframe environments (AS/400, z/OS) or willingness to learn
Modern Languages & Frameworks (Required): Java, C#, Python, JavaScript, Spring Boot, Hibernate, JDBC, Angular, Oracle PL/SQL.
Automation & IaC (Required): Python/Bash/PowerShell scripting, Terraform, Ansible, Jenkins, GitHub, Bitbucket, ServiceNow, Jira, Azure DevOps.
Monitoring Tools (Preferred): Splunk, Dynatrace, Resolve, Nobl9, JMeter, Zabbix.
Experience working with Windows, Linux and IBMi operating systems, and administration of applications within these operating systems.
Comprehensive knowledge of data center architecture and infrastructure components including server topology, networking, storage, and virtualization technologies.
Proficient in cybersecurity practices and data protection protocols relevant to data center environments.
Demonstrated ability to lead and motivate teams, coupled with strong communication and interpersonal capabilities.
Exceptional analytical skills and a commitment to continuous improvement.
Familiarity with SDLC, CI/CD, as well as DevOps and Site Reliability methodologies.
Resourceful and proactive in gathering information, resolving challenges, and promoting innovative solutions.
Excellent strategic thinking and innovation, supported by advanced problem-solving and analytical abilities.
Effective incident and problem management, including oversight and implementation of permanent solutions.
Outstanding communication skills and the ability to collaborate effectively with both technical and business stakeholders.
Well-versed in industry regulations and compliance standards pertinent to data center operations.
Bachelor’s degree in Computer Science, Information Technology, or a related discipline is required; a Master’s degree is preferred.

Tech Stack

Angular
Ansible
Azure
Cyber Security
Hibernate
Java
JavaScript
Jenkins
JMeter
Linux
Oracle
Python
SDLC
ServiceNow
Splunk
Spring
Spring Boot
SpringBoot
SQL
Terraform

Benefits

A career at FIS is more than just a job. It’s the change to shape the future of fintech.
Always-on learning and development
Collaborative work environment
Opportunities to give back
Competitive salary and benefits

Site Reliability Engineering Manager – Software Engineering, Mainframe

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits