Prove is a company focused on enhancing digital identity solutions through phone-centric identity tokenization. They are seeking a Staff Site Reliability Engineer to lead efforts in improving reliability, collaborating with engineering teams, and enforcing standards in observability and automation.

Responsibilities:

Guide infrastructure system design and application architecture within Platform Engineering and across engineering teams
Develop and implement infrastructure-as-code standards using tools like Terraform and/or OpenTofu
Act as a strong technical partner with our engineering teams
Champion observability with product owners and engineering teams
Improve new and existing systems by increasing reliability, performance, and scalability
Automate routine operational tasks to reduce toil and improve efficiency
Ensure infrastructure security compliance and implement least-privilege access controls
Enhance existing CI/CD pipelines and feedback loops for maximum reliability
Enable auto-scaling infrastructure based on custom metrics for applications and critical observability infrastructure
Participate in a 24/7 on-call rotation
Conduct thorough post-incident reviews and implement preventative measures
Use observability data to perform root cause analysis and identify system improvements

Requirements:

8+ years of experience in Site Reliability Engineering, Platform Engineering, or equivalent experience
4+ years of experience in technical project leadership
Deep understanding of cloud platforms, preferably AWS
Expert knowledge of observability platforms and practices (OpenTelemetry, Prometheus, Grafana, Jaeger, ELK stack)
Strong experience with Kubernetes and container orchestration
Deep familiarity with service mesh technologies
Experience with infrastructure-as-code tools (Terraform, Spacelift, OpenTofu)
Skilled proficiency in at least one programming language (Java, Go, Python)
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Experience with distributed systems and microservice architectures
Experience working in a high compliance environment
Hand-on experience instrumenting code with OpenTelemetry
Application development experience

Staff Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: