Bayer is a company driven to solve the world’s toughest challenges in health and agriculture. They are seeking a Senior Cloud Engineer specializing in observability to enhance their AWS platform, focusing on telemetry, monitoring, and reliability improvements.

Responsibilities:

Be the hands-on SME for our observability toolchain (e.g., Datadog, CloudWatch, OpenSearch), including log pipelines, tracing/telemetry standards, and platform templates
Run office hours, produce exemplars, and pair with teams to implement 'known-good' instrumentation and alerting
Triage and resolve observability-related platform requests (new service onboarding, log/metric gaps, noisy alerts, dashboard standards) with clear ownership and measurable outcomes
Establish and operationalize SLIs/SLOs for key platform components and enable teams to define service SLOs without reinventing the wheel
Maintain opinionated 'golden paths' for logging (standard fields/tags, retention, routing, searchability), metrics (naming conventions, cardinality guardrails, standard RED/USE views), tracing (service maps, critical spans, propagation standards), and dashboards (starter dashboards by service type + curated views for platform reliability)
Provide reusable templates for alerting patterns (latency, error-rate, saturation, dependency failures), tuned for actionable paging vs. noise
Reduce MTTR by improving detection, triage paths, runbooks, and 'what changed' visibility
Drive reliability reviews focused on observability gaps: missing signals, unclear ownership, bad alerts, and uninstrumented failure modes
Partner with delivery teams to turn recurring incidents into durable fixes (instrumentation + alerting + automation + documentation)
Embed observability checks into CI/CD and platform workflows (e.g., telemetry guardrails, dashboard/monitor templates, logging standards checks)
Partner with Security/Compliance to ensure telemetry supports auditability and incident investigation without ad-hoc effort
Define and report platform observability KPIs: alert noise rate, % actionable alerts, MTTA/MTTR trends, onboarding time to 'fully observable,' runbook coverage, incident recurrence
Run lightweight experiments to improve signal quality (threshold tuning, monitor redesign, dashboard UX), and ship improvements like a product owner
Create cost-aware telemetry standards (log volume controls, metric cardinality guidance, sampling strategies, retention tiers)
Help teams optimize spend while improving reliability outcomes ('cheaper + better' logging/metrics patterns)
Serve as a trusted partner to delivery units, Security, and Data—turning pain points into paved-road improvements
Mentor engineers and uplift organizational practices for incident response, reliability signals, and operational excellence

Requirements:

Bachelor's in computer science/engineering or equivalent experience
5+ years hands-on AWS experience operating production workloads
Deep practical experience with observability in production, including:
Datadog and/or CloudWatch (dashboards, monitors/alerts, log search, correlation)
Designing actionable alerts (noise reduction, ownership, runbook-first alerts)
Defining/using SLIs/SLOs and reliability metrics to drive behavior
Strong proficiency with Infrastructure as Code (Terraform; CloudFormation a plus)
Strong programming for automation/tooling (Python, Go, or similar)
Solid grasp of cloud architecture, networking, and security fundamentals
Experience productizing observability enablement (templates, golden paths, standards, onboarding workflows)
CI/CD at scale (GitLab pipelines), including integrating reliability/telemetry guardrails into delivery workflows
Logging/telemetry platforms beyond CloudWatch/Datadog (e.g., ELK/OpenSearch) and experience managing scale concerns (volume, retention, cardinality)
Container platforms (ECS/EKS) and common AWS data services (RDS/Aurora, S3/lake patterns, MSK/Kinesis)
FinOps experience related to observability (tagging, allocation, optimizing telemetry cost)
Relevant AWS certifications and excellent communication skills

Senior Cloud Engineer, Observability

Key skills

About this role

Responsibilities:

Requirements: