We are seeking an experienced Site Reliability Engineer (SRE) with expertise in Infrastructure as Code tools like Terraform, core CI/CD tools such as Azure DevOps, and monitoring tools including DataDog and AWS CloudWatch.
Strong leadership in client-facing discussions and engagement with third-party suppliers is essential.
Troubleshooting issues and identifying systemic failings indicated by incidents/failures.
Implementing fixes.
Proposing solutions for reducing toil.
Providing leadership in the Incident resolution process, including creating and maintaining documentation, and providing key input to Post-mortem analysis.
Improving Service Requests and Change Management processes, both technically and through stakeholder management.
Participate in the process for, and Proactively mitigate risks in a Security management process (Vulnerabilities in Code, Infrastructure, Dependencies).
Requirements
3-9 Years experience
Bachelor’s degree (or equivalent) in computer science or related discipline
SRE Foundation certificate (DevOps Institute) and a Cloud provider (AWS, Azure, GCP) 'associate'-level certification, or completed during the probationary period.
Proficiency in Azure and Kubernetes, with hands-on experience in managing and deploying applications.
Expertise in Infrastructure as Code (IaC) using Terraform for efficient and scalable infrastructure management.