TerraGiG is seeking a Site Reliability Engineer with extensive experience in enterprise observability and SRE architecture. The role involves designing the enterprise observability strategy, conducting tool evaluations, implementing reference architectures, and building operational maturity frameworks.
Responsibilities:
- Designing our enterprise observability strategy
- Conducting tool evaluations
- Implementing reference architectures
- Building operational maturity frameworks
Requirements:
- 8 -12 years in enterprise observability, SRE, or APM architecture
- Proven track record building observability platforms from scratch (not just tool implementation)
- Hands-on experience with Grafana, and familiarity with tools like Dynatrace, Datadog, Prometheus, or similar APM/monitoring platforms
- Strategic architecture experience: reference architectures, maturity models, stakeholder requirements gathering
- Experience with observability frameworks (MELT - Metrics/Events/Logs/Traces correlation)
- Strong understanding of MTTD/MTTR measurement and improvement
- Excellent English communication skills for US stakeholder collaboration
- Azure cloud platform experience (our primary environment)
- Financial services or professional services industry background
- Experience working with US-based teams
- Availability for overlap with US Eastern time zone
- OpenTelemetry, synthetic monitoring, and AIOps experience