Ampstek is seeking a Lead Observability Platform Engineer to design, build, and operate large scale observability services. The role involves leading the adoption of Open Telemetry-based standards across the enterprise while collaborating with various teams to enhance developer experience and ensure reliable observability at scale.
Responsibilities:
- Design, build, and operate core observability platform services using Go, Java (Spring Boot), and Node.js
- Lead enterprise-wide adoption of OpenTelemetry, including client libraries, semantic conventions, instrumentation patterns, and Collector/agent strategy
- Architect and scale high throughput, fault tolerant telemetry pipelines (logs, metrics, traces) with a focus on performance, reliability, and cost efficiency
- Develop self-service observability capabilities that simplify onboarding, troubleshooting, and adoption for application teams
- Implement end-to-end monitoring of the observability platform itself, defining SLOs, health checks, and alerting
- Collaborate with SRE, Platform, and Cloud teams to establish reliability standards, error budgets, and incident response practices
- Participate in on call rotations and lead incident mitigation, root cause analysis, and post incident reviews
- Automate operational workflows and eliminate manual toil through tooling, CI/CD enhancements, and platform automation
- Ensure secure telemetry pipelines through mTLS, secrets management, and zero trust design patterns
- Produce and maintain high-quality technical documentation, standards, and best practices
- Engage with internal engineering teams to gather requirements, influence roadmap prioritization, and deliver platform improvements
- Provide technical leadership through mentorship, design reviews, architectural guidance, and cross team collaboration with principal engineers and engineering leadership
Requirements:
- 7+ years of experience in Software Engineering, Platform Engineering, or SRE
- 5+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management
- 5+ years building production-grade backend services in Go and/or Java
- 5+ years implementing and operating Open Telemetry, including OTLP, semantic conventions, and instrumentation patterns
- 5+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD)
- 5+ years working with public cloud platforms (AWS, GCP, or Azure)
- 3+ years designing and scaling distributed, high volume data pipelines
- 3+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Mimir)
- 3+ years with relational databases (PostgreSQL, MySQL)
- Experience with service meshes and networking technologies such as Envoy and Istio
- Experience integrating or operating commercial observability platforms (Datadog, New Relic, AppDynamics, etc.)
- Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
- Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)
- Experience with Infrastructure as Code tools such as Terraform or CloudFormation
- Experience with cost optimization and capacity planning for large-scale telemetry systems
- Experience with chaos engineering, resiliency testing, or fault injection
- Background in security aware platform design, including secure service to service communication
- Experience mentoring senior engineers and influencing platform standards across organizations
- Strong operational experience supporting 24x7 production systems, including on call responsibilities
- Strong technical communication and cross team collaboration skills
- Experience operating in regulated or compliance heavy environments (e.g., healthcare, finance)