Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. They are seeking a skilled Observability Engineer to design and operate metrics, logging, tracing, and alerting platforms, ensuring engineering teams have confidence in the systems they run.
Responsibilities:
- Design and operate enterprise-grade observability platforms covering metrics, logs, traces, events, and synthetic monitoring
- Architect Prometheus / Thanos / Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog deployments for high availability and scale
- Develop standards for service instrumentation, including OpenTelemetry adoption, metric naming, label cardinality, and structured logging conventions
- Define and enforce SLOs, SLIs, and error budgets, and build the dashboards and alerts that operationalize them
- Build alerting strategies that minimize noise, surface actionable signals, and integrate cleanly with on-call workflows in PagerDuty, Opsgenie, or similar tools
- Operate large-scale time-series and log storage platforms, balancing retention, query performance, and cost
- Design distributed tracing pipelines and help teams use traces to diagnose latency and reliability issues
- Develop self-service tooling, paved-road libraries, and templates that make adoption of observability standards easy for product teams
- Drive cost management and label-cardinality discipline across the observability estate
- Lead incident response readiness improvements through better dashboards, alerting hygiene, and post-incident analysis tooling
- Partner with SRE and platform teams to integrate observability into deployment pipelines, canary analysis, and progressive delivery workflows
- Evaluate and recommend observability vendors and open-source tools based on cost, capability, and operational maturity
- Mentor engineering teams on observability fundamentals, debugging techniques, and SLO-driven operations
- Maintain documentation, onboarding guides, and runbooks for the observability platform
Requirements:
- Bachelor's degree in Computer Science or a related field
- Five or more years of experience in SRE, platform engineering, or observability roles
- Deep hands-on experience with Prometheus, Grafana, and at least one major commercial observability platform such as Datadog, New Relic, or Splunk
- Strong understanding of OpenTelemetry, distributed tracing, and structured logging
- Proficiency in at least one general-purpose language such as Go, Python, or Java
- Experience operating high-cardinality, high-throughput metrics and log pipelines
- Strong understanding of SLOs, error budgets, and SRE principles
- Experience integrating observability with CI/CD and incident management tooling
- Solid grasp of Linux internals, networking, and container platforms
- Excellent communication and collaboration skills
- Experience with Thanos, Mimir, Cortex, Loki, or Tempo at scale
- Contributions to OpenTelemetry or observability open-source projects
- Familiarity with eBPF-based observability tooling
- Experience driving observability cost optimization initiatives
- Exposure to regulated environments with audit-grade logging requirements