Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. They are seeking a skilled Observability Engineer to design and operate metrics, logging, tracing, and alerting platforms, ensuring engineering teams have confidence in the systems they run.

Responsibilities:

Design and operate enterprise-grade observability platforms covering metrics, logs, traces, events, and synthetic monitoring
Architect Prometheus / Thanos / Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog deployments for high availability and scale
Develop standards for service instrumentation, including OpenTelemetry adoption, metric naming, label cardinality, and structured logging conventions
Define and enforce SLOs, SLIs, and error budgets, and build the dashboards and alerts that operationalize them
Build alerting strategies that minimize noise, surface actionable signals, and integrate cleanly with on-call workflows in PagerDuty, Opsgenie, or similar tools
Operate large-scale time-series and log storage platforms, balancing retention, query performance, and cost
Design distributed tracing pipelines and help teams use traces to diagnose latency and reliability issues
Develop self-service tooling, paved-road libraries, and templates that make adoption of observability standards easy for product teams
Drive cost management and label-cardinality discipline across the observability estate
Lead incident response readiness improvements through better dashboards, alerting hygiene, and post-incident analysis tooling
Partner with SRE and platform teams to integrate observability into deployment pipelines, canary analysis, and progressive delivery workflows
Evaluate and recommend observability vendors and open-source tools based on cost, capability, and operational maturity
Mentor engineering teams on observability fundamentals, debugging techniques, and SLO-driven operations
Maintain documentation, onboarding guides, and runbooks for the observability platform

Requirements:

Bachelor's degree in Computer Science or a related field
Five or more years of experience in SRE, platform engineering, or observability roles
Deep hands-on experience with Prometheus, Grafana, and at least one major commercial observability platform such as Datadog, New Relic, or Splunk
Strong understanding of OpenTelemetry, distributed tracing, and structured logging
Proficiency in at least one general-purpose language such as Go, Python, or Java
Experience operating high-cardinality, high-throughput metrics and log pipelines
Strong understanding of SLOs, error budgets, and SRE principles
Experience integrating observability with CI/CD and incident management tooling
Solid grasp of Linux internals, networking, and container platforms
Excellent communication and collaboration skills
Experience with Thanos, Mimir, Cortex, Loki, or Tempo at scale
Contributions to OpenTelemetry or observability open-source projects
Familiarity with eBPF-based observability tooling
Experience driving observability cost optimization initiatives
Exposure to regulated environments with audit-grade logging requirements

Observability Engineer (Prometheus / Grafana / Datadog)

Key skills

About this role

Responsibilities:

Requirements: