Prove is a company focused on enhancing digital identity solutions through phone-centric identity tokenization. They are seeking a Staff Site Reliability Engineer to lead efforts in improving reliability, collaborating with engineering teams, and enforcing standards in observability and automation.
Responsibilities:
- Guide infrastructure system design and application architecture within Platform Engineering and across engineering teams
- Develop and implement infrastructure-as-code standards using tools like Terraform and/or OpenTofu
- Act as a strong technical partner with our engineering teams
- Champion observability with product owners and engineering teams
- Improve new and existing systems by increasing reliability, performance, and scalability
- Automate routine operational tasks to reduce toil and improve efficiency
- Ensure infrastructure security compliance and implement least-privilege access controls
- Enhance existing CI/CD pipelines and feedback loops for maximum reliability
- Enable auto-scaling infrastructure based on custom metrics for applications and critical observability infrastructure
- Participate in a 24/7 on-call rotation
- Conduct thorough post-incident reviews and implement preventative measures
- Use observability data to perform root cause analysis and identify system improvements
Requirements:
- 8+ years of experience in Site Reliability Engineering, Platform Engineering, or equivalent experience
- 4+ years of experience in technical project leadership
- Deep understanding of cloud platforms, preferably AWS
- Expert knowledge of observability platforms and practices (OpenTelemetry, Prometheus, Grafana, Jaeger, ELK stack)
- Strong experience with Kubernetes and container orchestration
- Deep familiarity with service mesh technologies
- Experience with infrastructure-as-code tools (Terraform, Spacelift, OpenTofu)
- Skilled proficiency in at least one programming language (Java, Go, Python)
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- Experience with distributed systems and microservice architectures
- Experience working in a high compliance environment
- Hand-on experience instrumenting code with OpenTelemetry
- Application development experience