Planet is a company that designs, builds, and operates the largest constellation of imaging satellites in history. The Site Reliability Engineer will be responsible for building, deploying, and operating critical compute software that supports end-to-end imaging operations within customer environments, ensuring the reliability, scalability, and availability of services.
Responsibilities:
- Build and deploy computing services and infrastructure in customer environments for a next-generation satellite operations and image processing end-to-end platform
- Operate in a high-impact, tight knit team to architect novel systems for air-gapped deployments at scale
- Clarify and surface requirements from ambiguous use cases defined by cross-functional stakeholders, including internal users and external customers
- Responsible for operations such as deployments, service orchestration, and documentation for cross platform stakeholders
- Scale architecture while ensuring availability of services
- Improve reliability and scalability by resolving edge cases, studying failure modes, and writing tests
- Participate in on-call rotations to ensure operational excellence
Requirements:
- Bachelor's degree in Computer Science or similar
- 10+ years of experience building services that leverage cloud-native infrastructure and tooling
- Experience deploying and maintaining bare-metal and cloud kubernetes through tools such as Talos, RKE2, Proxmox, or k3s
- Proficiency with Terraform, Ansible, Helm, Kustomize, and/or similar IaC / GitOps tooling
- Experience successfully building, releasing, and supporting highly available, consistently performant services
- Knowledge of hardware and network level implications of on-prem compute
- Experience with platform optimization, particularly resource optimization, management, and cluster tuning in a constrained environment
- Ability to observe and troubleshoot distributed systems with tools such as Alloy, Prometheus, Grafana, and OpenTelemetry
- Advanced skills in Python, Bash, and other tooling as appropriate to build services and meet product goals
- Excellent communication skills and the ability to work through collaboration with cross-functional engineering teams
- Experience working with Jira for task management and progress tracking
- Experience with CUDA-based GPU programs
- Security expertise in sensitive environments, including implementing zero-trust architectures, hardening Kubernetes clusters, conducting security audits, and deploying workloads in air-gapped environments