Support and extend dashboards, metrics tracking, and APM tracing infrastructure inside Datadog and Sumo Logic
Maintain multi-tenant workspaces and universal tagging compliance across teams
Configure and manage PagerDuty infrastructure
Maintain service orchestrations, alert routing rules, event intelligence settings, on-call calendar schedules, and native alerts integration across collaboration platforms (Slack)
Optimize telemetry pipeline data flows using Cribl to eliminate noise, drop duplicate fields, and strip out bloated payloads
Ensure high-value signals reach Sumo Logic and Datadog while directing low-value compliance logs to archival cold storage
Fully automate the deployment, onboarding, patch management, and state consistency of monitoring agents and pipeline configurations using Ansible Playbooks and Roles
Enforce telemetry schemas, log signatures, and operational golden signals across the enterprise
Serve as an engineering mentor across internal product teams, building out technical documentation, runbooks, and leading enablement sessions for modern logging and alerting procedures
Requirements
3 years of Python development experience
Proven expertise in Datadog, including AWS integrations and dashboard templating
Experience with SignalFX/Splunk Observability Cloud and legacy monitoring paradigms
Experience working across Infra, App, and DevOps teams to create relevant metrics
Experience with applying Site Reliability Engineering (SRE) concepts
Strong understanding of AWS architecture and cloud-native observability
Strong understanding of monitoring distributed systems
Familiarity with OpenShift or Kubernetes
Familiarity with Ansible
Familiarity with Infrastructure-as-Code concepts
Familiarity with OpenTelemetry
Excellent communication and stakeholder management skills
Certifications in Datadog, AWS, or related observability platforms (preferred)
Experience in enterprise-scale monitoring transformations (preferred)