Role Overview

Lead and participate in the design, implementation, and maintenance of highly available and scalable infrastructure.
Monitor system health, performance metrics, and capacity planning to ensure optimal performance.
Establish and track SLIs, SLOs, and error budgets to measure and improve system reliability.
Design and implement Infrastructure as Code (IaC) solutions using tools like Terraform, Pulumi, or CloudFormation.
Build and maintain CI/CD pipelines to enable rapid, safe deployments.
Automate operational tasks and eliminate toil through scripting and tooling.
Lead incident response efforts, including on-call rotation, post-mortem analysis, and remediation.
Debug and resolve complex production issues across the entire stack.
Implement monitoring, alerting, and observability solutions to detect and prevent issues proactively.
Provide technical leadership and mentorship to engineers on reliability and infrastructure best practices.
Collaborate with cross-functional teams, including Engineering and Product to ensure reliable product delivery.
Lead the technical design of infrastructure solutions, ensuring alignment with architectural principles and business goals.
Stay updated with emerging technologies and industry trends in SRE, DevOps, and cloud infrastructure.
Propose and drive the adoption of best practices, tools, and processes to enhance system reliability and developer productivity.
Conduct chaos engineering experiments and disaster recovery drills to validate system resilience.
Implement and maintain security best practices across infrastructure and applications.
Manage secrets, access controls, and security monitoring systems.
Foster a collaborative environment within the engineering team and across departments.
Clearly communicate technical concepts and system health to both technical and non-technical stakeholders.
Work closely with engineering teams to define reliability requirements and ensure operational excellence.

Requirements

5+ years (ideally 7+) of relevant work experience in Site Reliability Engineering, DevOps, or Infrastructure roles
1+ years of hands-on experience with either Python, Go, or Bash scripting
Experience with cloud platforms (ideally GCP) and container orchestration (Kubernetes, Docker)
Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, or similar)
Strong understanding of Linux systems, networking, and distributed systems
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or similar)
Excellent problem-solving and communication skills
Able to work independently and as part of a team

Tech Stack

Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Kubernetes
Linux
Prometheus
Python
Terraform
Go

Benefits

Be part of a mission-driven, rapidly scaling company changing the future of eye care
Work remotely from anywhere in the U.S.
Collaborate with a passionate, fun, and supportive team
Competitive salary
$150,000
$200,000
Equity in a fast-growing startup
Health, vision, and dental benefits
Unlimited PTO
Annual professional development stipend
A high-impact role with plenty of room for growth, ownership, and creativity

Senior Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits