Kyndryl is a company that designs, builds, manages, and modernizes mission-critical technology systems. They are seeking a Site Reliability Engineer (SRE) to ensure reliability and innovation in their information systems, driving continuous improvement and exceptional service to customers.

Responsibilities:

Join us as a Site Reliability Engineer (SRE) and embark on an exciting journey of ensuring reliability, resiliency, and innovation in our information systems and ecosystems
As an SRE at Kyndryl, you'll be at the forefront of driving continuous improvement and delivering exceptional service to our customers
Your role goes beyond traditional engineering, as you'll have the opportunity to analyze business needs, tackle complex problems, and provide strategic advice and designs
You'll be involved in every stage of the software lifecycle, from building and testing to deploying changes and maintaining robust systems
We're looking for a true visionary who can think strategically and help shape the future of our services
Your expertise in building trusted relationships with customers and partnering with them for success will be instrumental in driving our growth
As an SRE, you'll have the unique opportunity to work on end-to-end services, spanning customer sites and platforms
Collaboration and proactivity are key as you work alongside a talented team of professionals, eager to make a difference
You'll embrace an entrepreneurial mindset, taking ownership of your responsibilities and constantly seeking innovative solutions
With an unwavering focus on quality, robustness, and security, you'll be a driving force in implementing cutting-edge tools that enhance our operations, improve reliability, and gather valuable feedback on our platforms
Your ability to identify and mitigate common operational issues will play a crucial role in delivering seamless experiences to our customers

Requirements:

Must have 15+ years MF modernization experience
10+ years of experience in operational management, including incident management and escalations
Experience with design and implementation of application monitoring to ensure reliability and performance meets or exceeds business goals
Experience implementing strategies to cap operations load and to handle overflow using appropriate tooling and metrics; defining service level indicators and objectives in collaboration with stakeholders, business, development, DevSecOps and Operations teams
Solution and design experience in an enterprise environment: Mainframe Onprem zSeries, zOS, DB2, IMS, CICS, Job Scheduling tools such as Zeke, Jobtrak, Job Class/Weights, Cobol recompile, ABO, Omegmon for performance etc
BS degree in Computer Science, Engineering, or other highly technical, scientific discipline
Expertise with Ansible, Terraform, and Python
Experience with distributed technologies as well as dynamic resource management frameworks such as Kubernetes
Expertise in leveraging open-source tooling such as Prometheus, Grafana, or Loki

Site Reliability Engineer Mainframe

Key skills

About this role

Responsibilities:

Requirements: