SentinelOne is redefining cybersecurity by leveraging AI-powered, data-driven innovation. They are seeking an experienced engineering and operational Manager to lead a Site Reliability Engineering (SRE) team, focusing on ensuring the reliability and scalability of their products and production services while collaborating with various engineering teams and customer-facing departments.
Responsibilities:
- Grow and lead a team of SRE professionals, including setting performance goals and measuring deliverables against key metrics, while evolving those metrics as S1 grows and needs develop
- Invest in data-driven deep triage on recurring issues, collaborating with other engineering teams to identify and address issues related to reliability, performance, and capacity
- Develop, improve, and implement processes for the full incident lifecycle, including incident management, post-incident analysis, and learning from incidents. Lead incident response efforts, including coordinating with other teams to investigate and resolve customer-impacting incidents
- Design support model for SRE regarding service maturity and service ownership, including monitoring and alerting improvements, and SLI / SLO design and implementation
- Analyze production metrics and signals to identify areas for improvement and take proactive steps to mitigate issues
- Develop and implement best practices and standards for Site Reliability Engineering, from day-to-day operations to hiring and planning
- Communicate effectively with cross-functional teams to ensure alignment on objectives and priorities. Deliver outcomes, not just stories and tasks
Requirements:
- 8+ years of related engineering experience, with at least 2 years in a management role
- Demonstrated experience leading technical and operational teams at various stages of maturity
- Excellent analytical and problem-solving skills
- Familiarity with modern software development methodologies, tools, and techniques, including CI/CD
- Experience working with cloud-native applications and large-scale distributed systems, including a working knowledge of technologies such as Kubernetes and Terraform/IaC, and cloud providers such as AWS or GCP
- Experience with various monitoring and alerting techniques and tools, including frameworks and concepts such as SLOs, OTel and Golden Signals as well as tooling such as Prometheus and Grafana
- Extensive experience with incident response and management at various layers of the stack across different business needs and applications, including both hands-on experience leading incidents/post-incident analysis and experience driving broader incident management initiatives
- Ability to thrive in a fast-paced, dynamic environment