Okta is a company focused on securing identities in the age of AI. They are seeking a Senior Site Reliability Engineer who will manage large-scale cloud production systems, ensuring infrastructure reliability and performance while supporting national security missions.
Responsibilities:
- Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability
- Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions
- Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows
- Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery
Requirements:
- Active TS/SCI clearance
- Deep familiarity with FedRAMP and DoD IL6 compliance standards
- B.S. in Computer Science or equivalent professional experience
- 5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts
- Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby
- Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation)
- Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments
- Solid understanding of networking concepts and IP protocols
- Experience with multi-cloud environments