Role Overview

Leads a team responsible for enterprise observability platforms and core development tooling, enabling fast detection, diagnosis, and resolution of production issues.
Owns the reliability, security, scalability, performance, and instrumentation of CI/CD and developer platforms to improve system reliability and developer productivity at scale.
Partners with engineering, SRE/Operations, and Security to embed observability and operational excellence across the software delivery lifecycle.
Hire, coach, set priorities, and build a culture of reliability, ownership, learning, and continuous improvement.
Define vision/roadmap and standards for metrics, logs, traces, alerting, dashboards, and service health; promote early instrumentation and SLI/SLO-based practices.
Provide governance and strategic oversight for Azure DevOps, GitHub Enterprise, Jenkins, and related tooling; define guardrails for repos, branching, pipelines, build agents, and integrations.
Partner on incident response, RCA, and post-mortems; improve detection, triage, rollback, recovery, and on-call readiness.
Manage platforms via Infrastructure as Code; standardize configurations and operational practices; evaluate tools with an eye to capability, complexity, risk, and cost.
Ensure auditability, access controls/reviews, and logging meet enterprise requirements; optimize platform spend without degrading reliability or developer experience.
Convert technical information into business value and coordinate stakeholders across application, platform, and infrastructure teams.

Requirements

Experience leading engineering teams in Observability/SRE/Platform/DevOps Tools.
Hands-on background with observability platforms (e.g., Datadog, Splunk, New Relic, Grafana, OpenTelemetry).
Experience operating/governing CI/CD and developer platforms (Azure DevOps, GitHub Enterprise, Jenkins).
Strong understanding of cloud-native architectures (AWS and/or Azure).
Infrastructure as Code experience ( e.g., Terraform, CloudFormation).
Strong leadership, communication, and organizational skills.

Tech Stack

AWS
Azure
Cloud
Grafana
Jenkins
Splunk
Terraform

Benefits

Health Benefits: Comprehensive, multi-carrier program for medical, dental and vision benefits
Retirement Benefits: 401(k) with match and an Employee Share Purchase Plan
Wellbeing: Wellness platform with incentives, Headspace app subscription, Employee Assistance and Time-off Programs
Short-and-Long Term Disability, Life and Accidental Death Insurance, Critical Illness, and Hospital Indemnity
Family Benefits, including bonding and family care leaves, adoption and surrogacy benefits
Health Savings, Health Care, Dependent Care and Commuter Spending Accounts
Up to two days of paid leave each to participate in Employee Resource Groups and to volunteer with your charity of choice

Observability & DevOps Tools Engineering Manager

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits