Netskope is a market-leading cloud security company focused on redefining cloud, network, and data security. The Sr. Staff DevOps Engineer will be critical in designing, provisioning, and managing scalable cloud infrastructure for the Agentic AI platform, ensuring reliable deployments and maintaining highly available Kubernetes clusters.
Responsibilities:
- Work closely with the engineering team, AI/ML engineers to design and architect scalable, secure cloud environments for Agentic Applications using Infrastructure as Code (Terraform)
- Design, implement, and manage CI/CD pipelines to ensure safe, repeatable, and reliable deployments across environments
- Manage and improve release processes including versioning, rollback strategies, blue/green and canary deployments
- Provision and manage Kubernetes clusters across multiple environments, ensuring high availability and scalability
- Implement auto-scaling strategies for infrastructure and workloads to optimize performance and cost
- Set up and manage monitoring, logging, and alerting systems for infrastructure and application workloads
- Operate and oversee large Kubernetes clusters supporting production workloads
- Improve reliability, quality, and time-to-market of our software delivery lifecycle
- Measure and optimize system performance, proactively identifying bottlenecks and implementing improvements
- Provide primary operational support and engineering for multiple large-scale distributed systems and cloud environments
- Operate and oversee large Kubernetes clusters with GPU workloads
Requirements:
- 10+ years of professional experience building and operating core infrastructure systems
- Strong hands-on experience with Infrastructure as Code tools such as Terraform
- Deep experience with Kubernetes and container orchestration at scale
- Experience with major cloud providers (AWS, Google Cloud, or Azure)
- Experience designing and managing CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar)
- Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows
- Experience implementing monitoring and observability solutions using tools such as Prometheus, Grafana, or similar
- Proven track record of building and operating scalable, reliable, and secure production systems
- Strong troubleshooting skills across distributed systems and cloud-native architectures
- Proactive attitude in identifying reliability risks, performance bottlenecks, and automation opportunities
- Comfortable working with ambiguity and rapid change in a dynamic environment
- Familiarity with LLM development, deployment, and optimization techniques
- Familiarity with high-performance, large-scale ML systems and their unique infrastructure needs
- BSCS or equivalent required, MSCS or equivalent strongly preferred