About this role

TerraGiG is seeking a Site Reliability Engineer with extensive experience in enterprise observability and SRE architecture. The role involves designing the enterprise observability strategy, conducting tool evaluations, implementing reference architectures, and building operational maturity frameworks.

Responsibilities:

Designing our enterprise observability strategy
Conducting tool evaluations
Implementing reference architectures
Building operational maturity frameworks

Requirements:

8 -12 years in enterprise observability, SRE, or APM architecture
Proven track record building observability platforms from scratch (not just tool implementation)
Hands-on experience with Grafana, and familiarity with tools like Dynatrace, Datadog, Prometheus, or similar APM/monitoring platforms
Strategic architecture experience: reference architectures, maturity models, stakeholder requirements gathering
Experience with observability frameworks (MELT - Metrics/Events/Logs/Traces correlation)
Strong understanding of MTTD/MTTR measurement and improvement
Excellent English communication skills for US stakeholder collaboration
Azure cloud platform experience (our primary environment)
Financial services or professional services industry background
Experience working with US-based teams
Availability for overlap with US Eastern time zone
OpenTelemetry, synthetic monitoring, and AIOps experience

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: