Server and container health (CPU, memory, disk, network, capacity trends)
Database health and performance (availability, replication, query latency, resource utilization)
Application and infrastructure logging, including centralized log ingestion, indexing, and search.
Build actionable alerts with clear runbooks, ownership, and escalation paths to minimize mean time to detect (MTTD) and mean time to resolve (MTTR).
Partner with application, platform, and DevOps teams to instrument services with metrics, traces, and structured logs.
Continuously improve signal quality by reducing alert noise, eliminating false positives, and optimizing thresholds based on historical trends.
Create and maintain dashboards for real-time operational visibility and executive-level health reporting. Support incident response and post-incident reviews by providing high-fidelity telemetry and contributing to root cause analysis.
Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations
Hands-on expertise with Prometheus, Grafana, Loki, and Tempo in large-scale, production environments
Strong understanding of monitoring distributed systems spanning both On-Premises and Cloud environments (GCP, Azure)
Experience defining SLOs/SLIs and building alerting strategies based on reliability engineering best practices
Exceptional attention to detail with the ability to think through complex systems end-to-end, anticipate edge cases, failure modes, and cascading impacts, and proactively design monitoring and alerting to cover both common and rare operational scenarios