Instrumentation Hands-on: Make code changes to implement distributed tracing, custom metrics and structured logs using OpenTelemetry SDKs.
Time Series Data Architecture: Design and maintain high-performance metrics pipelines using modern metrics storage solutions (time-series databases) capable of handling large volume and high cardinality.
Visualization Ecosystem: Create analytical dashboards in Grafana and advanced monitors in Datadog, focusing on the Golden Signals (Latency, Errors, Traffic and Saturation).
Multi-Cloud Operation: Configure metrics and traces collection across AWS, GCP and Azure environments, ensuring a unified view of the infrastructure.
Business Monitoring: Create time series that reflect the health of the SaaS platform and real-time betting behavior (e.g., bets/sec vs. API latency).
Error Culture: Define and implement technical SLIs/SLOs, ensuring the engineering team has actionable alerts and avoids alert fatigue.
Requirements
Technical Seniority: Solid experience as an SRE or Software Engineer focused on infrastructure and performance.
Practical experience building observability solutions for distributed architectures, including distributed tracing, service instrumentation, high-cardinality metrics and microservices monitoring.
Experience with OpenTelemetry is desirable, but the most important requirement is mastery of telemetry and large-scale system visibility concepts.
Cloud Proficiency: Hands-on experience administering and monitoring resources across multiple clouds (especially AWS and GCP).
Power User of Datadog & Grafana: Ability to create complex queries (PromQL/LogQL) and configure advanced APM.
Development: Ability to read and modify code in multiple languages for performance and telemetry adjustments.
Desirable: experience with storage and analysis of observability data at scale (e.g., VictoriaMetrics, ClickHouse, Prometheus/Mimir, or other time-series and analytics solutions).