Role Overview

You’ll design and maintain infrastructure that is highly available, fault-tolerant, and scalable
You’ll proactively identify and eliminate single points of failure before they become incidents
You’ll ensure our production systems remain stable, even under increasing scale and load
You’ll manage and continuously improve workloads across AWS, GCP, or Azure
You’ll use Infrastructure as Code (Terraform) to standardize and scale infrastructure
You’ll optimize resource usage to balance performance and cost
You’ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence
You’ll troubleshoot issues quickly and ensure smooth deployments and upgrades
You’ll ensure our containerized workloads perform reliably at scale
You’ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELK
You’ll define alerting that is meaningful, not noisy
You’ll respond to incidents, lead root cause analysis, and ensure we learn from every failure
You’ll write scripts and build tooling to eliminate repetitive operational work
You’ll continuously improve infrastructure efficiency through automation
You’ll promote a culture where manual work is a temporary state, not the norm
You’ll work closely with DevOps and engineering teams to solve performance bottlenecks
You’ll contribute to CI/CD improvements and deployment reliability
You’ll help shape reliability best practices across the organization

Requirements

You’ve spent ~3 years working in SRE, DevOps, or infrastructure engineering, and you’ve seen what breaks at scale
You’re comfortable working in cloud environments like AWS, GCP, or Azure—and you understand how distributed systems behave
You’ve worked hands-on with Kubernetes in production and know how to troubleshoot it when things go wrong
You don’t just fix issues
you ask why they happened and make sure they don’t happen again
Use Terraform (or similar IaC tools) to manage infrastructure
Work confidently with Docker and Kubernetes
Write scripts in Python, Bash, or similar to automate workflows
Understand CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.)
Have a solid grasp of networking, load balancing, and high-availability design
You’ve implemented tools like Prometheus, Grafana, Datadog, or ELK
You know the difference between useful alerts and noise
You focus on signals that actually drive action
You take ownership
you don’t wait to be told something is broken
You’re calm under pressure and methodical during incidents
You simplify complexity instead of adding to it
You communicate clearly, even when explaining deeply technical issues
You care about building systems that make other engineers more effective
Nice to Have (but not required)
Experience with RabbitMQ or Redis in production
Familiarity with Ansible or AWX
Exposure to multi-cloud or hybrid environments
Cloud certifications (AWS, GCP) or Linux certifications
Background from ITI (Information Technology Institute)

Tech Stack

Ansible
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Jenkins
Kubernetes
Linux
Prometheus
Python
RabbitMQ
Redis
Terraform

Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack