Ensure the reliability, availability, and scalability of the systems and services in the Product Areas (PAs) to which they are assigned.
Develop and implement monitoring, observability, and alerting solutions integrated with the Agentic Engineering Platform.
Support teams in defining and tracking SLIs, SLOs, and error budgets.
Design and evolve on-call management across Product Areas: rotations, escalation, alerting tools, and incident management.
Work closely with the Engineering Platform to ensure platform capabilities reach and are adopted by product teams.
Actively contribute to the evolution of the Agentic Engineering Platform by bringing real feedback from Product Areas about frictions, gaps, and improvement opportunities.
Participate in and influence the building of a reliability-oriented engineering culture (SRE) across the company.
Support migrations of critical systems, environment segregation, and the deprecation of legacy technologies.
Requirements
Experience with cloud environments, preferably Google Cloud Platform (GCP).
Proficiency with observability tools and practices (Prometheus, Grafana, Loki, Thanos, Elasticsearch, Alertmanager, etc.).
Strong knowledge of Kubernetes and distributed systems architecture.
Strong knowledge of Infrastructure as Code (IaC) and Terraform.
Hands-on experience with incident management, on-call processes, and post-mortems.
Experience defining and tracking SLOs and error budgets.
Ability to analyze logs and the performance of distributed systems.
Strong communication and influencing skills: able to advocate technical solutions to diverse audiences — engineers, PMs, and leadership.
Data-driven mindset, using metrics to map risks, prioritize actions, and demonstrate impact.