Placer.ai is transforming how organizations understand the physical world through its location analytics platform. They are seeking a Data Platform Engineer to own and scale the Kubernetes infrastructure for their large-scale data processing platform, focusing on making distributed data workloads reliable, cost-efficient, and performant at scale.
Responsibilities:
- Operate and scale Kubernetes clusters with thousands of nodes supporting large-scale Spark and data processing workloads
- Manage and optimize Apache Spark on Kubernetes — executor autoscaling, driver scheduling, resource tuning, spot instance strategies
- Deploy and tune remote shuffle services (e.g., Apache Celeborn) to handle shuffle data at scale across multiple availability zones
- Operate and improve self-hosted Apache Airflow infrastructure on Kubernetes
- Configure and optimize batch schedulers (e.g., YuniKorn, Volcano) for gang scheduling, fair-share queuing, and resource prioritization
- Drive cost optimization across large compute fleets — spot vs. on-demand strategies, node right-sizing, autoscaling policies, local SSD utilization
- Support and collaborate with Data Engineering teams on workload performance, resource allocation, and infrastructure requirements
- Manage infrastructure-as-code (Terraform) and GitOps deployments (ArgoCD, Helm) for data platform services
- Integrate with managed data platforms (e.g., Databricks) and cloud storage for hybrid processing architectures
Requirements:
- 3+ years of experience operating Kubernetes in production at significant scale (hundreds to thousands of nodes)
- Hands-on experience with Apache Spark on Kubernetes — you understand executors, drivers, dynamic allocation, shuffle behavior, and how they map to K8s primitives
- Strong understanding of Kubernetes internals — scheduling, resource management, node autoscaling, pod lifecycle, taints/tolerations, local storage
- Experience with cloud infrastructure (GCP preferred) — managed Kubernetes, spot/preemptible instances, local SSDs, networking at scale
- Comfortable with infrastructure-as-code (Terraform) and GitOps workflows
- Proficiency in Python or Go
- Experience operating Apache Airflow at scale on Kubernetes
- Experience with Apache Celeborn or similar remote shuffle services
- Familiarity with YuniKorn or Volcano batch schedulers
- Experience with Databricks administration and integration
- Knowledge of data formats and storage systems (Parquet, Delta Lake, cloud object storage)
- Experience with streaming or messaging systems (Kafka)
- Experience with Prometheus/Grafana observability stacks for data platform monitoring
- Contributions to open-source data infrastructure projects