Career Renew is recruiting for one of its clients, a fast-growing software company supporting and developing Hedera, an open-source, proof-of-stake public ledger. They are seeking a Senior Site Reliability Engineer to design, deploy, and ensure the reliability of multi-region infrastructure for large organizations across various sectors.
Responsibilities:
- Design, build, and operate highly available, multi-region distributed systems with clear recovery strategies and tested RTO/RPO
- Partner with the Head of SRE to define the reliability roadmap, platform architecture, and operational standards
- Own large-scale Infrastructure as Code using Terraform, including reusable modules, multi-account patterns, and policy guardrails
- Operate and scale Kubernetes environments (EKS, GKE, or AKS) using GitOps practices (ArgoCD), Helm, and strong RBAC and network policies
- Build and maintain secure CI/CD pipelines, including blue/green and canary deployments, promotion and rollback strategies, and artifact integrity (SBOM, signing)
- Define and improve SRE practices, including SLOs, error budgets, observability, and measurable reductions in MTTR/MTTA
- Work closely with product and engineering teams to translate customer and business requirements into reliable, secure platform services
- Contribute to the operational support and continuous improvement of customer-facing HashSphere deployments
Requirements:
- Proven experience designing and building production-grade systems on Azure
- Ability to take ambiguous requirements to structured technical solutions to delivered systems
- Strong technical communication skills across engineering and non-technical stakeholders
- High ownership mindset with a bias for action and accountability
- Collaborative approach with a focus on building durable, scalable solutions
- Azure cloud services (networking, compute, identity, security, storage)
- Terraform (infrastructure as code at production scale)
- Programming experience in Go and/or Python
- Experience building greenfield infrastructure environments
- Distributed systems, high-availability architectures, or platform engineering
- CI/CD and automation tooling for infrastructure lifecycle management
- Kubernetes and container orchestration
- Observability tooling (Prometheus, Grafana)
- Workflow/orchestration platforms (Argo, Spacelift, or similar)