Bayer is a company driven to solve the world’s toughest challenges in health and agriculture. They are seeking a Senior Cloud Engineer specializing in observability to enhance their AWS platform, focusing on telemetry, monitoring, and reliability improvements.
Responsibilities:
- Be the hands-on SME for our observability toolchain (e.g., Datadog, CloudWatch, OpenSearch), including log pipelines, tracing/telemetry standards, and platform templates
- Run office hours, produce exemplars, and pair with teams to implement 'known-good' instrumentation and alerting
- Triage and resolve observability-related platform requests (new service onboarding, log/metric gaps, noisy alerts, dashboard standards) with clear ownership and measurable outcomes
- Establish and operationalize SLIs/SLOs for key platform components and enable teams to define service SLOs without reinventing the wheel
- Maintain opinionated 'golden paths' for logging (standard fields/tags, retention, routing, searchability), metrics (naming conventions, cardinality guardrails, standard RED/USE views), tracing (service maps, critical spans, propagation standards), and dashboards (starter dashboards by service type + curated views for platform reliability)
- Provide reusable templates for alerting patterns (latency, error-rate, saturation, dependency failures), tuned for actionable paging vs. noise
- Reduce MTTR by improving detection, triage paths, runbooks, and 'what changed' visibility
- Drive reliability reviews focused on observability gaps: missing signals, unclear ownership, bad alerts, and uninstrumented failure modes
- Partner with delivery teams to turn recurring incidents into durable fixes (instrumentation + alerting + automation + documentation)
- Embed observability checks into CI/CD and platform workflows (e.g., telemetry guardrails, dashboard/monitor templates, logging standards checks)
- Partner with Security/Compliance to ensure telemetry supports auditability and incident investigation without ad-hoc effort
- Define and report platform observability KPIs: alert noise rate, % actionable alerts, MTTA/MTTR trends, onboarding time to 'fully observable,' runbook coverage, incident recurrence
- Run lightweight experiments to improve signal quality (threshold tuning, monitor redesign, dashboard UX), and ship improvements like a product owner
- Create cost-aware telemetry standards (log volume controls, metric cardinality guidance, sampling strategies, retention tiers)
- Help teams optimize spend while improving reliability outcomes ('cheaper + better' logging/metrics patterns)
- Serve as a trusted partner to delivery units, Security, and Data—turning pain points into paved-road improvements
- Mentor engineers and uplift organizational practices for incident response, reliability signals, and operational excellence
Requirements:
- Bachelor's in computer science/engineering or equivalent experience
- 5+ years hands-on AWS experience operating production workloads
- Deep practical experience with observability in production, including:
- Datadog and/or CloudWatch (dashboards, monitors/alerts, log search, correlation)
- Designing actionable alerts (noise reduction, ownership, runbook-first alerts)
- Defining/using SLIs/SLOs and reliability metrics to drive behavior
- Strong proficiency with Infrastructure as Code (Terraform; CloudFormation a plus)
- Strong programming for automation/tooling (Python, Go, or similar)
- Solid grasp of cloud architecture, networking, and security fundamentals
- Experience productizing observability enablement (templates, golden paths, standards, onboarding workflows)
- CI/CD at scale (GitLab pipelines), including integrating reliability/telemetry guardrails into delivery workflows
- Logging/telemetry platforms beyond CloudWatch/Datadog (e.g., ELK/OpenSearch) and experience managing scale concerns (volume, retention, cardinality)
- Container platforms (ECS/EKS) and common AWS data services (RDS/Aurora, S3/lake patterns, MSK/Kinesis)
- FinOps experience related to observability (tagging, allocation, optimizing telemetry cost)
- Relevant AWS certifications and excellent communication skills