Okta is a company focused on securing identities in the era of AI. They are seeking a Senior Site Reliability Engineer to lead the evolution of large-scale production systems, ensuring reliability and performance while supporting critical national security missions.
Responsibilities:
- Design, build, and oversee Okta’s production infrastructure, ensuring architectural integrity and peak performance
- Act as a senior escalation point for production incidents, conducting deep-dive root cause analysis and implementing permanent, automated preventive solutions
- Eliminate manual toil by developing sophisticated automation frameworks, evolving monitoring tools, and establishing rigorous technical documentation
- Optimize a highly available, large-scale environment, ensuring 'Always On' service delivery across complex network topologies
- Provide technical guidance to the engineering organization, championing SRE best practices and a culture of self-education
Requirements:
- Must be able to obtain and maintain a U.S. security clearance (Secret or Top Secret) to the extent required by U.S. Government contracts
- Active TS/SCI with Polygraph
- Deep professional experience with FedRAMP and DoD IL6 frameworks
- B.S. in Computer Science or equivalent technical experience
- Mastery of AWS networking and security, including Transit Gateways, VPCs, Route Tables, ELBs, and NACLS
- Advanced experience automating enterprise-scale infrastructure via Terraform or CloudFormation
- Expert-level Linux systems administration with proficiency in Go, Python, Bash, or Ruby
- Proven success managing Docker containers and Java-based stacks (Apache/Tomcat) in high-security production environments
- Solid understanding of networking concepts, IP protocols, and multi-cloud infrastructure