Vantage Data Centers powers and connects technology for major hyperscalers and cloud providers, focusing on innovative data center solutions. They are seeking a Lead Observability Engineer to design and maintain their observability platform, ensuring system health and performance across data centers. The role involves building ingestion pipelines, creating metrics, and collaborating with various teams to enhance operational outcomes.
Responsibilities:
- Design and operate a scalable observability platform with a primary focus on the Elastic Stack (Elasticsearch, Logstash, Kibana)
- Build and maintain log ingestion and enrichment pipelines (Logstash) including parsing, normalization, and routing standards
- Create, curate, and govern Kibana assets (dashboards, visualizations, Lens, Discover views) that support operations and engineering use cases
- Define and implement metrics and alerting standards (SLIs/SLOs, thresholds, burn-rate alerts) to improve detection and reduce MTTR
- Develop observability metrics for Operational Technology (OT) environments (e.g., BMS/EPMS/SCADA and other OT telemetry) to track availability, performance, alarms, and operational KPIs
- Partner across teams to instrument services and infrastructure, troubleshoot incidents using telemetry, and continuously improve reliability, performance, and cost
- Engineer and operate Elasticsearch clusters (or Elastic Cloud) including sizing, scaling, sharding/ILM, retention, backup/restore, and performance tuning
- Develop and maintain Logstash pipelines (inputs/filters/outputs) to ingest logs/metrics from servers, network devices, virtualization platforms, containers, and cloud services
- Create telemetry standards: field naming conventions, ECS alignment where appropriate, parsing/grok patterns, enrichment lookups, and data quality checks
- Build and maintain Kibana dashboards, visualizations, and alerting rules; publish curated views for NOC/operations and engineering teams
- Create and operationalize metrics: define SLIs/SLOs, implement metric collection/export, and ensure actionable alerting with runbooks and escalation paths
- Partner with facilities/critical infrastructure teams to define OT-focused SLIs/SLOs and metrics (e.g., alarm rates, sensor health, control loop status, device/point availability), normalize and tag OT telemetry, and build dashboards/alerts that support 24x7 operations
- Automate configuration and deployment of observability components using infrastructure-as-code and configuration management (e.g., Terraform, Ansible) and CI/CD pipelines
- Implement security best practices for telemetry platforms including role-based access control, data handling/PII controls, encryption, and auditability
- Participate in incident response and post-incident reviews; use logs and metrics to identify root cause, document findings, and drive preventive improvements
Requirements:
- Bachelor's degree in Computer Science, Information Technology, Engineering, or equivalent practical experience
- 7+ years of experience in observability, SRE, platform, or DevOps engineering roles, with demonstrated ownership of production monitoring/logging systems
- Deep, hands-on expertise with Elastic Stack: Elasticsearch (cluster operations and tuning), Logstash (pipeline engineering), and Kibana (dashboards and alerting)
- Strong experience creating and operationalizing metrics (collection, aggregation, cardinality management, alerting strategy) and defining SLIs/SLOs
- Proficiency in at least one scripting/programming language (Python, Ruby, Go, or PowerShell) for automation, data parsing, and platform tooling
- Strong knowledge of log and event data modeling, parsing (grok/regex), enrichment, and schema management (e.g., ECS), including troubleshooting ingestion issues end-to-end
- Experience with query languages and analysis techniques (KQL/Lucene, Elasticsearch DSL), and ability to build actionable visualizations and detections from telemetry
- Hands-on experience with index lifecycle management (ILM), data streams, retention policies, and capacity planning for high-volume telemetry workloads
- Experience with infrastructure-as-code and automation (e.g., Terraform, Ansible) and CI/CD practices to deploy and manage observability components
- Solid understanding of Linux systems, networking fundamentals, and distributed systems concepts as they relate to telemetry, performance, and troubleshooting
- Experience integrating telemetry from hybrid environments (data center infrastructure, virtualization, containers/Kubernetes, and cloud services)
- Familiarity with Operational Technology (OT) / Industrial Control Systems (ICS) observability concepts and common BMS/EPMS applications such as Inductive Automation Ignition
- Working knowledge of complementary observability tooling (e.g., Beats/Elastic Agent, Prometheus, Grafana, OpenTelemetry) and how to integrate telemetry between systems to follow event management practices
- Experience operating services with on-call practices, incident management, and post-incident review processes; ability to write clear runbooks
- Strong understanding of reliability engineering concepts including observability design, alert fatigue reduction, and measuring user/system impact
- Experience working within ITIL or similar operational practices (change, incident, problem management)
- Familiarity with regulatory/compliance expectations and secure handling of operational data (e.g., GDPR, PCI, SOX) as applicable
- Excellent written and verbal communication skills; ability to translate telemetry needs into standards, dashboards, and alerts that teams adopt
- Ability to work independently and collaboratively across infrastructure, network, security, and application teams
- Data center industry experience is strongly preferred, but not required