Role Overview
- Partner with Engineering teams to design resilient services, architectures, and deployment patterns.
- Define and promote SRE practices including SLIs, SLOs, error budgets, capacity planning, incident response, and post-incident learning.
- Identify systemic reliability risks and work with teams to address root causes.
- Help reduce operational toil through automation, tooling, and better engineering practices.
- Work actively with Engineering teams during design, development, and production-readiness reviews.
- Advise and challenge teams on service architecture, fault tolerance, scalability, observability, deployment safety, and operational readiness, helping them to make pragmatic trade-offs.
- Support teams in diagnosing complex performance, latency, throughput, and resource-utilisation issues.
- Help establish engineering standards and reusable patterns for reliable, maintainable services.
- Lead investigations into performance bottlenecks across applications, infrastructure, databases, queues, networks, and third-party dependencies.
- Improve observability through metrics, logs, traces, dashboards, alerting, and service-level indicators.
- Help teams design meaningful alerts that identify user-impacting issues while reducing noise.
- Drive capacity planning and load-testing practices for critical systems.
- Build and improve automation, deployment tooling, infrastructure-as-code, monitoring, and reliability platforms.
- Contribute to CI/CD improvements, release safety, rollback strategies, and progressive delivery practices.
- Develop tools that help Engineering teams self-serve reliability, diagnostics, and operational insights.
- Improve cloud, container, and orchestration environments with a focus on security, reliability, and scalability.
- Participate in incident response for high-priority production issues.
- Lead or contribute to blameless post-incident reviews.
- Ensure actions from incidents result in improvements to architecture, tooling, monitoring, or process.
- Mentor engineers on production ownership and operational best practices.
Requirements
- Experience in Site Reliability Engineering or senior backend/software engineering roles.
- Software engineering background, with the ability to write clean, maintainable production code.
- Experience working with Engineering teams to influence architecture and improve production readiness.
- Understanding of distributed systems, scalability, resiliency patterns, failure modes, and performance engineering.
- Experience diagnosing complex production issues across application and infrastructure layers.
- Hands-on experience with cloud platforms such as AWS, Azure, or GCP.
- Hands-on experience with on-premise environments and virtualization.
- Experience with containers and orchestration technologies, Kubernetes is a must.
- Knowledge of observability tooling, including metrics, logging, tracing, dashboards, and alerting.
- Experience with infrastructure-as-code tools such as Terraform.
- Experience with CI/CD pipelines and safe deployment practices.
- Strong scripting or programming skills in languages such as Python, Go, Java, C#, JavaScript/TypeScript, or similar.
- Clear and structured communication skills, with the ability to explain complex technical issues clearly to engineering and leadership audiences.
Tech Stack
- AWS
- Azure
- Cloud
- Distributed Systems
- Google Cloud Platform
- Java
- JavaScript
- Kubernetes
- Python
- Terraform
- TypeScript
- Go
Benefits
We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as “Protected Classes”). We do not tolerate employment discrimination in the workplace, and we are committed to making reasonable accommodations for identified disabilities or other limitations as required by all applicable laws. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.