Define and own the enterprise-wide observability architecture, establishing technical standards, reference architectures, and multi-year roadmaps.
Evaluate, select, and standardize observability tooling (e.g., Grafana, Prometheus, VictoriaMetrics, Tempo, Loki, Elastic Stack, OpenTelemetry) to reduce tool sprawl and optimize total cost of ownership.
Design scalable data pipelines and storage strategies capable of ingesting and querying petabyte-scale telemetry data across metrics, traces, logs, and continuous profiling.
Design Terraform modules and Helm charts for declarative observability infrastructure provisioning across multi-cloud environments.
Establish and enforce instrumentation standards using the OpenTelemetry framework, including SDK guidelines, collector deployment patterns, and semantic conventions.
Define and champion SLO/SLI/error-budget frameworks across engineering teams, providing architectural guidance on service-level objective implementation.
Serve as a senior escalation point during critical incidents, leveraging deep observability expertise to accelerate diagnosis and resolution.
Provide architectural mentorship and technical guidance to Observability Engineers and SRE team members.
Requirements
5-8 years of experience in Observability Architecture, Site Reliability Engineering (SRE), or Platform/Infrastructure Engineering.
Post-secondary Diploma/Degree in Engineering, Computer Science, or a related field.
Mastery of the OpenTelemetry ecosystem and expert-level knowledge of Prometheus-compatible metrics systems (VictoriaMetrics, Thanos, etc.).
Advanced experience with tracing systems (Grafana Tempo, Jaeger) and log aggregation platforms (Loki, Elasticsearch, Google BigQuery).
Expert-level proficiency in cloud infrastructure (GCP strongly preferred) and Kubernetes architecture.
Strong software engineering skills in Go, Python, or similar languages for building cloud-native tooling.
Excellent communication skills with the ability to influence technical direction across organizational boundaries.
Preferred certifications: Google Cloud Professional Cloud Architect or Certified Kubernetes Administrator (CKA).