Home
Jobs
Saved
Resumes
Site Reliability Architect at qode.world | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Site Reliability Architect
qode.world
Website
LinkedIn
Site Reliability Architect
Texas, United States of America
Full Time
2 hours ago
No Visa Sponsorship
Apply Now
Key skills
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Grafana
Kafka
Microservices
Prometheus
Terraform
AI
ML
GCP
Google Cloud
OpenTelemetry
Dynatrace
Leadership
About this role
Role Overview
Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
Build actionable dashboards for operations, engineering, and leadership
Implement alerting strategies using static and dynamic thresholds
Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
Transition monitoring from reactive alerts to proactive insights
Monitor and troubleshoot multi-service architectures involving Microservices, Downstream APIs, Kafka / streaming platforms, Cloud infrastructure (Terraform, IaC)
Identify whether issues originate from Upstream/downstream dependencies, Streaming platform, Infrastructure, Application code
Deep hands-on experience with Dynatrace (mandatory)
Experience with OpenTelemetry, Prometheus / Grafana, ELK / EFK, Cloud-native monitoring (AWS/Azure/GCP)
Requirements
15+ years in SRE / Production Engineering
Strong Unified Observability background (not infra-only)
Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
SLI/SLO engineering experience in production systems
Experience implementing dynamic thresholds and anomaly detection
Knowledge of AI/ML concepts applied to Ops (AIOps)
Distributed systems troubleshooting expertise
Experience with Kafka or streaming data platforms
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Grafana
Kafka
Microservices
Prometheus
Terraform
Apply Now
Home
Jobs
Saved
Resumes