Leads a team responsible for enterprise observability platforms and core development tooling, enabling fast detection, diagnosis, and resolution of production issues.
Owns the reliability, security, scalability, performance, and instrumentation of CI/CD and developer platforms to improve system reliability and developer productivity at scale.
Partners with engineering, SRE/Operations, and Security to embed observability and operational excellence across the software delivery lifecycle.
Hire, coach, set priorities, and build a culture of reliability, ownership, learning, and continuous improvement.
Define vision/roadmap and standards for metrics, logs, traces, alerting, dashboards, and service health; promote early instrumentation and SLI/SLO-based practices.
Provide governance and strategic oversight for Azure DevOps, GitHub Enterprise, Jenkins, and related tooling; define guardrails for repos, branching, pipelines, build agents, and integrations.
Partner on incident response, RCA, and post-mortems; improve detection, triage, rollback, recovery, and on-call readiness.
Manage platforms via Infrastructure as Code; standardize configurations and operational practices; evaluate tools with an eye to capability, complexity, risk, and cost.
Ensure auditability, access controls/reviews, and logging meet enterprise requirements; optimize platform spend without degrading reliability or developer experience.
Convert technical information into business value and coordinate stakeholders across application, platform, and infrastructure teams.
Requirements
Experience leading engineering teams in Observability/SRE/Platform/DevOps Tools.
Hands-on background with observability platforms (e.g., Datadog, Splunk, New Relic, Grafana, OpenTelemetry).