Ensure the reliability and performance of our data platform services (Trino, Iceberg, S3, Kafka, Flink)
Define and implement SRE best practices: SLIs/SLOs, error budgets, and observability
Build and maintain monitoring, alerting, and incident response frameworks (Prometheus, Grafana, etc.)
Contribute to the migration from a public cloud data warehouse to VeepeeCloud’s lakehouse stack
Support coexistence between cloud and on-prem systems and ensure data consistency and service reliability
Help design resilient architectures for ingestion, transformation, and serving layers
Operate and improve services running on Kubernetes (GKE/EKS and on-prem clusters)
Automate infrastructure provisioning using Terraform, Atlantis, and/or Crossplane
Improve GitOps workflows for platform deployment and configuration
Collaborate with teams to optimize compute and storage usage (Trino queries, BigQuery slots, etc.)
Build tools and dashboards to track cost, usage, and efficiency
Support the transition toward cost-efficient on-prem workloads
Improve self-service capabilities for data teams (e.g., provisioning Trino/Iceberg resources)
Help teams adopt best practices in reliability, observability, and deployment
Write clear technical documentation and runbooks
Contribute to the definition and implementation of the Disaster Recovery Plan (DRP)
Ensure multi-DC resilience (FR1 / NL1) and implement data replication strategies
Participate in incident management and postmortems

Site Reliability Engineer – Data Platform

Key skills