Act as Incident Commander during events impacting SaaS availability, ensuring a scientific approach to troubleshooting and resolution, as well as clear stakeholder communications
Be a champion for post-mortem culture by leading blameless incident retrospectives, identifying systemic root causes and improvement opportunities.
Facilitate our change enablement practice, partnering with engineering teams to optimize velocity and production risk when composing infrastructure and software change planning.
Identify and execute on opportunities to reduce operational toil, automate manual processes, or make our processes leaner while empowering teams and reducing cross-team dependencies
Lead Chaos Engineering activities to promote incident preparedness among engineering teams
Requirements
Bachelor’s degree in Computer Science, Engineering, or equivalent experience (Completed or in Final year)
Familiarity with cloud infrastructure and CI/CD pipelines (Azure, networking, Kubernetes, SQL, etc.)
Experience with multiple programming and scripting languages such as Bash, Java, T-SQL, Terraform/Helm, Python, and TypeScript
Knowledge of monitoring and observability tools and the ability to interpret data and logically hypothesize probable failure modes (AppDynamics, Prometheus, ELK, etc.)
Strong analytical skills in root cause analysis, debugging, and technical/system-level problem solving
Experience facilitating or leading technical teams
Ability to articulate complex technical concepts clearly to diverse audiences, both verbally and in written formats
Ability to effectively interface with AI, use context, and perform prompt engineering to support day to day work
Familiarity with Agile, DevOps, SRE, ITIL, platform engineering, or Scrum practices
Ability to build and present insightful data visualizations using any BI tool or language