Planet is a company that designs, builds, and operates the largest constellation of imaging satellites in history. The Site Reliability Engineer will be responsible for building, deploying, and operating critical compute software that supports end-to-end imaging operations within customer environments, ensuring the reliability, scalability, and availability of services.

Responsibilities:

Build and deploy computing services and infrastructure in customer environments for a next-generation satellite operations and image processing end-to-end platform
Operate in a high-impact, tight knit team to architect novel systems for air-gapped deployments at scale
Clarify and surface requirements from ambiguous use cases defined by cross-functional stakeholders, including internal users and external customers
Responsible for operations such as deployments, service orchestration, and documentation for cross platform stakeholders
Scale architecture while ensuring availability of services
Improve reliability and scalability by resolving edge cases, studying failure modes, and writing tests
Participate in on-call rotations to ensure operational excellence

Requirements:

Bachelor's degree in Computer Science or similar
10+ years of experience building services that leverage cloud-native infrastructure and tooling
Experience deploying and maintaining bare-metal and cloud kubernetes through tools such as Talos, RKE2, Proxmox, or k3s
Proficiency with Terraform, Ansible, Helm, Kustomize, and/or similar IaC / GitOps tooling
Experience successfully building, releasing, and supporting highly available, consistently performant services
Knowledge of hardware and network level implications of on-prem compute
Experience with platform optimization, particularly resource optimization, management, and cluster tuning in a constrained environment
Ability to observe and troubleshoot distributed systems with tools such as Alloy, Prometheus, Grafana, and OpenTelemetry
Advanced skills in Python, Bash, and other tooling as appropriate to build services and meet product goals
Excellent communication skills and the ability to work through collaboration with cross-functional engineering teams
Experience working with Jira for task management and progress tracking
Experience with CUDA-based GPU programs
Security expertise in sensitive environments, including implementing zero-trust architectures, hardening Kubernetes clusters, conducting security audits, and deploying workloads in air-gapped environments

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: