Kyndryl is a company that designs, builds, manages, and modernizes mission-critical technology systems. They are seeking a Site Reliability Engineer to ensure reliability and innovation in their information systems while driving continuous improvement and delivering exceptional service to customers.
Responsibilities:
- Ensure reliability, resiliency, and innovation in our information systems and ecosystems
- Analyze business needs, tackle complex problems, and provide strategic advice and designs
- Be involved in every stage of the software lifecycle, from building and testing to deploying changes and maintaining robust systems
- Build trusted relationships with customers and partner with them for success
- Work on end-to-end services, spanning customer sites and platforms
- Collaborate and proactively work alongside a talented team of professionals
- Take ownership of responsibilities and constantly seek innovative solutions
- Implement cutting-edge tools that enhance operations, improve reliability, and gather valuable feedback on platforms
- Identify and mitigate common operational issues to deliver seamless experiences to customers
Requirements:
- 10+ years of experience in operational management, including incident management and escalations
- Experience with design and implementation of application monitoring to ensure reliability and performance meets or exceeds business goals
- Experience implementing strategies to cap operations load and to handle overflow using appropriate tooling and metrics; defining service level indicators and objectives in collaboration with stakeholders, business, development, DevSecOps and Operations teams
- Solution and design experience in an enterprise environment: Windows server, Linux server (RHEL is preferred), UNIX (AIX, Solaris), Windows server, storage, and Hyperscaler Cloud (AWS, Azure, Google Cloud Platform); public cloud platforms such as AWS, OpenShift, Azure or GCP
- Experience working with Data format and Scripting languages JSON, YAML, Bash and/or PowerShell
- BS degree in Computer Science, Engineering, or other highly technical, scientific discipline
- Expertise with Ansible, Terraform, and Python
- Experience with distributed technologies as well as dynamic resource management frameworks such as Kubernetes
- Expertise in leveraging open-source tooling such as Prometheus, Grafana, or Loki