Zayo Group is a provider of mission-critical bandwidth to impactful companies, and they are seeking a Senior Site Reliability Engineer (SRE) to ensure the uptime, performance, and scalability of their critical infrastructure. The role involves incident management, reliability engineering, and collaboration with various teams to develop robust technical solutions.
Responsibilities:
- Incident Management: Own the incident lifecycle, from leading root cause analysis and resolution to implementing preventative measures to avoid future occurrences. Be on call to diagnose and resolve critical service outages
- Reliability Engineering: Proactively identify and mitigate potential system risks, focusing on automation, monitoring, and tooling to ensure high service availability
- Scalability and Performance: Design and implement solutions to ensure our infrastructure can handle ever-growing demands while maintaining optimal application performance
- Collaboration: Work closely with developers, product managers, and other engineers to translate business needs into robust and reliable technical solutions. Become the beacon for best practices and efficient processes throughout the organization
- Continuous Learning: Stay up to date with the latest trends and technologies in SRE practices, automation tools, and cloud platforms
Requirements:
- Bachelor's degree in computer science, engineering, or a related field (or equivalent experience)
- Minimum of 7 (seven) years of experience in a Site Reliability Engineering or related role
- Strong understanding of system administration, Linux, and scripting languages (Python and various shells)
- Experience working in large scale distributed production environments
- Experience with cloud platforms especially troubleshooting and debugging in AWS using AWS native tools
- Experience with container orchestration (Kubernetes and Docker)
- Experience deploying and managing scalable infrastructure within AWS and Kubernetes ecosystems using Terraform and other cloud-native approaches
- Experience with infrastructure management tools such as Ansible, Terraform, Puppet, etc
- Experience with monitoring and alerting tools (Prometheus, Grafana, Cacti, etc.)
- Experience with monitoring platforms such as SevOne, Assure1, and Nagios
- Proven ability to work independently and as part of a team
- Excellent problem-solving, analytical, and critical thinking skills
- A passion for automation and building efficient systems
- Experience working with various vendor APIs (or Netconf) including Nokia, Juniper, Fujitsu, Infinera, Cisco, and Ciena
- Strong working knowledge of networking concepts and application protocols, especially TCP/IP, BGP, DNS, and more with focused experience on Internet Service Provider services such as IP VPN, Transit, and Waves
- Experience with various network orchestration platforms such as Ciena Blue Planet MDSO, Cisco NSO, Nokia NSP, or others
- Previous experience with Golang