Zayo Group is a provider of mission-critical bandwidth to impactful companies, and they are seeking a Senior Site Reliability Engineer (SRE) to ensure the uptime, performance, and scalability of their critical infrastructure. The role involves incident management, reliability engineering, and collaboration with various teams to develop robust technical solutions.

Responsibilities:

Incident Management: Own the incident lifecycle, from leading root cause analysis and resolution to implementing preventative measures to avoid future occurrences. Be on call to diagnose and resolve critical service outages
Reliability Engineering: Proactively identify and mitigate potential system risks, focusing on automation, monitoring, and tooling to ensure high service availability
Scalability and Performance: Design and implement solutions to ensure our infrastructure can handle ever-growing demands while maintaining optimal application performance
Collaboration: Work closely with developers, product managers, and other engineers to translate business needs into robust and reliable technical solutions. Become the beacon for best practices and efficient processes throughout the organization
Continuous Learning: Stay up to date with the latest trends and technologies in SRE practices, automation tools, and cloud platforms

Requirements:

Bachelor's degree in computer science, engineering, or a related field (or equivalent experience)
Minimum of 7 (seven) years of experience in a Site Reliability Engineering or related role
Strong understanding of system administration, Linux, and scripting languages (Python and various shells)
Experience working in large scale distributed production environments
Experience with cloud platforms especially troubleshooting and debugging in AWS using AWS native tools
Experience with container orchestration (Kubernetes and Docker)
Experience deploying and managing scalable infrastructure within AWS and Kubernetes ecosystems using Terraform and other cloud-native approaches
Experience with infrastructure management tools such as Ansible, Terraform, Puppet, etc
Experience with monitoring and alerting tools (Prometheus, Grafana, Cacti, etc.)
Experience with monitoring platforms such as SevOne, Assure1, and Nagios
Proven ability to work independently and as part of a team
Excellent problem-solving, analytical, and critical thinking skills
A passion for automation and building efficient systems
Experience working with various vendor APIs (or Netconf) including Nokia, Juniper, Fujitsu, Infinera, Cisco, and Ciena
Strong working knowledge of networking concepts and application protocols, especially TCP/IP, BGP, DNS, and more with focused experience on Internet Service Provider services such as IP VPN, Transit, and Waves
Experience with various network orchestration platforms such as Ciena Blue Planet MDSO, Cisco NSO, Nokia NSP, or others
Previous experience with Golang

Sr Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: