Netskope is a leading cloud security company focused on redefining Cloud, Network, and Data Security. The Staff Site Reliability Engineer will work on improving the reliability and performance of engineering stacks, develop software for operational problems, and ensure optimal availability of training environments across multiple clusters.

Responsibilities:

Partner closely with service owners and engineers to develop reliable services driven by best practices
Develop software and tools to solve a variety of problems across service and infrastructure
Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs
Ensure training environments are consistently available and prepared across multiple clusters
Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
Provide primary operational support and engineering for multiple large-scale distributed software applications

Requirements:

Software programming experience in any programming language
Good understanding of principles of distributed systems
Deep understanding of Kubernetes and Docker
Understanding of data technologies like Kafka, Yugabyte, Redis etc
Good understanding of AWS ecosystem
Basic understanding of networking
Exposure to Infrastructure as code tools like Terraform
Familiar with monitoring tools such as Prometheus, Grafana, or similar
8+ years building core infrastructure
BSCS or equivalent required, MSCS or equivalent strongly preferred
Experience in operating and monitoring services communicating across AWS and private clouds
Experience operating Kubernetes at scale

Staff Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: