Netskope is a market-leading cloud security company focused on redefining cloud, network, and data security. The Sr. Staff DevOps Engineer will be critical in designing, provisioning, and managing scalable cloud infrastructure for the Agentic AI platform, ensuring reliable deployments and maintaining highly available Kubernetes clusters.

Responsibilities:

Work closely with the engineering team, AI/ML engineers to design and architect scalable, secure cloud environments for Agentic Applications using Infrastructure as Code (Terraform)
Design, implement, and manage CI/CD pipelines to ensure safe, repeatable, and reliable deployments across environments
Manage and improve release processes including versioning, rollback strategies, blue/green and canary deployments
Provision and manage Kubernetes clusters across multiple environments, ensuring high availability and scalability
Implement auto-scaling strategies for infrastructure and workloads to optimize performance and cost
Set up and manage monitoring, logging, and alerting systems for infrastructure and application workloads
Operate and oversee large Kubernetes clusters supporting production workloads
Improve reliability, quality, and time-to-market of our software delivery lifecycle
Measure and optimize system performance, proactively identifying bottlenecks and implementing improvements
Provide primary operational support and engineering for multiple large-scale distributed systems and cloud environments
Operate and oversee large Kubernetes clusters with GPU workloads

Requirements:

10+ years of professional experience building and operating core infrastructure systems
Strong hands-on experience with Infrastructure as Code tools such as Terraform
Deep experience with Kubernetes and container orchestration at scale
Experience with major cloud providers (AWS, Google Cloud, or Azure)
Experience designing and managing CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar)
Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows
Experience implementing monitoring and observability solutions using tools such as Prometheus, Grafana, or similar
Proven track record of building and operating scalable, reliable, and secure production systems
Strong troubleshooting skills across distributed systems and cloud-native architectures
Proactive attitude in identifying reliability risks, performance bottlenecks, and automation opportunities
Comfortable working with ambiguity and rapid change in a dynamic environment
Familiarity with LLM development, deployment, and optimization techniques
Familiarity with high-performance, large-scale ML systems and their unique infrastructure needs
BSCS or equivalent required, MSCS or equivalent strongly preferred

Sr. Staff DevOps Engineer, Agentic AI

Key skills

About this role

Responsibilities:

Requirements: