Collaborate deeply with our infrastructure and product teams to enforce org-wide practices for emitting and collecting telemetry across a wide range of services, both internal and external facing.
Own and operate the Kubernetes infrastructure of the observability team.
Work within the Observability team to ensure industry-standard deployment and reliability practices are used.
Orchestrate and scale systems such as VictoriaMetrics, OpenTelemetry Collector, and Vector.
Requirements
5+ years of experience in a Site Reliability Engineering role
Experience operating and supporting clustered applications in production environments
Hands-on experience deploying and managing applications in Kubernetes (k8s) environments
Working knowledge of PostgreSQL, including administration, performance tuning, and troubleshooting
Proficiency with at least one Infrastructure as Code (IaC) tool (e.g., Terraform, Pulumi, OpenTofu, or equivalent)
Experience with telemetry tooling such as OpenTelemetry, VictoriaMetrics, Grafana, Prometheus.
Experience with AWS services is a plus
Strong documentation and communication skills is a plus