About this role

Okta is a company focused on securing identities in the age of AI. They are seeking a Senior Site Reliability Engineer who will manage large-scale cloud production systems, ensuring infrastructure reliability and performance while supporting national security missions.

Responsibilities:

Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability
Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions
Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows
Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery

Requirements:

Active TS/SCI clearance
Deep familiarity with FedRAMP and DoD IL6 compliance standards
B.S. in Computer Science or equivalent professional experience
5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts
Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby
Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation)
Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments
Solid understanding of networking concepts and IP protocols
Experience with multi-cloud environments

Staff Site Reliability Engineer, Kubernetes w/ active TS/SCI

Key skills

About this role

Responsibilities:

Requirements: