Home
Jobs
Saved
Resumes
Senior Observability & Telemetry Engineer at Submer | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Senior Observability & Telemetry Engineer
Submer
Remote
Website
LinkedIn
Senior Observability & Telemetry Engineer
United Kingdom
Full Time
5 hours ago
Apply Now
Key skills
Cloud
Distributed Systems
Kubernetes
Python
Rust
Go
AI
CI/CD
About this role
Role Overview
Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure.
Architect observability systems capable of ingesting high-cardinality telemetry from thousands of nodes and services.
Build and operate telemetry storage systems optimized for large-scale time-series and event data.
Contribute to observability standards across services, including metrics, tracing instrumentation, logging, and SLO implementation.
Instrument GPU clusters, inference workloads, and distributed training environments.
Implement telemetry pipelines for GPU, CPU, network, and storage performance metrics.
Build dashboards and monitoring tools that expose system health and performance to both internal teams and customers.
Develop performance analysis tools that help customers understand system bottlenecks.
Develop and maintain network observability platforms.
Requirements
Proven experience operating large distributed infrastructure platforms.
Strong background in observability systems and telemetry pipelines.
Experience building metrics, logging, tracing, alerting, and dashboards at production scale.
Strong programming skills in Go, Python, or Rust.
Experience with large-scale time-series data platforms.
Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure.
Experience monitoring AI workloads such as training or inference clusters.
Deep understanding of distributed systems observability.
Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD.
Experience operating observability systems for high-performance or large-scale environments.
Experience monitoring complex networking environments.
Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry.
Strong data analysis capabilities.
Ability to interpret complex telemetry signals and translate them into actionable insights.
Ability to diagnose performance issues across distributed systems.
Tech Stack
Cloud
Distributed Systems
Kubernetes
Python
Rust
Go
Benefits
Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
Opportunity to be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
Inclusive Responsibility committed to creating a diverse and inclusive environment.
Apply Now
Home
Jobs
Saved
Resumes