Kyndryl is a company that designs, builds, manages, and modernizes mission-critical technology systems. They are seeking a Site Reliability Engineer (SRE) to ensure reliability and innovation in their information systems, driving continuous improvement and exceptional service to customers.
Responsibilities:
- Join us as a Site Reliability Engineer (SRE) and embark on an exciting journey of ensuring reliability, resiliency, and innovation in our information systems and ecosystems
- As an SRE at Kyndryl, you'll be at the forefront of driving continuous improvement and delivering exceptional service to our customers
- Your role goes beyond traditional engineering, as you'll have the opportunity to analyze business needs, tackle complex problems, and provide strategic advice and designs
- You'll be involved in every stage of the software lifecycle, from building and testing to deploying changes and maintaining robust systems
- We're looking for a true visionary who can think strategically and help shape the future of our services
- Your expertise in building trusted relationships with customers and partnering with them for success will be instrumental in driving our growth
- As an SRE, you'll have the unique opportunity to work on end-to-end services, spanning customer sites and platforms
- Collaboration and proactivity are key as you work alongside a talented team of professionals, eager to make a difference
- You'll embrace an entrepreneurial mindset, taking ownership of your responsibilities and constantly seeking innovative solutions
- With an unwavering focus on quality, robustness, and security, you'll be a driving force in implementing cutting-edge tools that enhance our operations, improve reliability, and gather valuable feedback on our platforms
- Your ability to identify and mitigate common operational issues will play a crucial role in delivering seamless experiences to our customers
Requirements:
- Must have 15+ years MF modernization experience
- 10+ years of experience in operational management, including incident management and escalations
- Experience with design and implementation of application monitoring to ensure reliability and performance meets or exceeds business goals
- Experience implementing strategies to cap operations load and to handle overflow using appropriate tooling and metrics; defining service level indicators and objectives in collaboration with stakeholders, business, development, DevSecOps and Operations teams
- Solution and design experience in an enterprise environment: Mainframe Onprem zSeries, zOS, DB2, IMS, CICS, Job Scheduling tools such as Zeke, Jobtrak, Job Class/Weights, Cobol recompile, ABO, Omegmon for performance etc
- BS degree in Computer Science, Engineering, or other highly technical, scientific discipline
- Expertise with Ansible, Terraform, and Python
- Experience with distributed technologies as well as dynamic resource management frameworks such as Kubernetes
- Expertise in leveraging open-source tooling such as Prometheus, Grafana, or Loki