Ampstek is seeking a Lead Observability Platform Engineer to design, build, and operate large scale observability services. The role involves leading the adoption of Open Telemetry-based standards across the enterprise while collaborating with various teams to enhance developer experience and ensure reliable observability at scale.

Responsibilities:

Design, build, and operate core observability platform services using Go, Java (Spring Boot), and Node.js
Lead enterprise-wide adoption of OpenTelemetry, including client libraries, semantic conventions, instrumentation patterns, and Collector/agent strategy
Architect and scale high throughput, fault tolerant telemetry pipelines (logs, metrics, traces) with a focus on performance, reliability, and cost efficiency
Develop self-service observability capabilities that simplify onboarding, troubleshooting, and adoption for application teams
Implement end-to-end monitoring of the observability platform itself, defining SLOs, health checks, and alerting
Collaborate with SRE, Platform, and Cloud teams to establish reliability standards, error budgets, and incident response practices
Participate in on call rotations and lead incident mitigation, root cause analysis, and post incident reviews
Automate operational workflows and eliminate manual toil through tooling, CI/CD enhancements, and platform automation
Ensure secure telemetry pipelines through mTLS, secrets management, and zero trust design patterns
Produce and maintain high-quality technical documentation, standards, and best practices
Engage with internal engineering teams to gather requirements, influence roadmap prioritization, and deliver platform improvements
Provide technical leadership through mentorship, design reviews, architectural guidance, and cross team collaboration with principal engineers and engineering leadership

Requirements:

7+ years of experience in Software Engineering, Platform Engineering, or SRE
5+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management
5+ years building production-grade backend services in Go and/or Java
5+ years implementing and operating Open Telemetry, including OTLP, semantic conventions, and instrumentation patterns
5+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD)
5+ years working with public cloud platforms (AWS, GCP, or Azure)
3+ years designing and scaling distributed, high volume data pipelines
3+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Mimir)
3+ years with relational databases (PostgreSQL, MySQL)
Experience with service meshes and networking technologies such as Envoy and Istio
Experience integrating or operating commercial observability platforms (Datadog, New Relic, AppDynamics, etc.)
Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)
Experience with Infrastructure as Code tools such as Terraform or CloudFormation
Experience with cost optimization and capacity planning for large-scale telemetry systems
Experience with chaos engineering, resiliency testing, or fault injection
Background in security aware platform design, including secure service to service communication
Experience mentoring senior engineers and influencing platform standards across organizations
Strong operational experience supporting 24x7 production systems, including on call responsibilities
Strong technical communication and cross team collaboration skills
Experience operating in regulated or compliance heavy environments (e.g., healthcare, finance)

Lead Observability Platform Engineer

Key skills

About this role

Responsibilities:

Requirements: