Own and architect cloud infrastructure across AWS and GCP, including EC2, EKS/Kubernetes, RDS, ElastiCache, S3, and networking components (VPCs, load balancers, DNS), driving improvements that increase reliability and reduce operational burden
Lead the design and implementation of secrets management strategies using Hashicorp Vault and other tools, establishing organizational standards for secure configuration management
Architect and evolve infrastructure-as-code practices using Terraform, driving adoption of patterns that improve consistency and reduce deployment risk
Design and optimize deployment pipelines and CI/CD systems, troubleshoot complex deployment failures with Git and FluxCD, and establish best practices for safe, reliable releases
Support database operations including migrations and performance tuning
Own Kafka clusters and message queue systems, including architecture decisions, capacity planning, and troubleshooting complex processing issues
Participate in 24/7 oncall rotations, responding to alerts, triaging incidents, and coordinating with development teams to resolve production issues
Design and implement monitoring, alerting, and observability strategies using Prometheus, Grafana, Sumo Logic, and related tools, establishing organizational standards that catch issues before customers notice them
Define and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, balancing business needs with engineering resources
Lead blameless post-mortems, write comprehensive incident analyses that teach others, and drive systemic improvements that prevent entire classes of incidents
Champion access controls, IAM policies, and security configurations across cloud environments, ensuring infrastructure meets compliance and security requirements
Identify and eliminate systemic sources of operational toil by designing automation, building self-service tooling, and improving developer workflows that scale the team's impact
Lead AI-assisted automation initiatives to streamline SRE processes, implementing solutions that reduce toil and improve incident response
Partner with product development teams as the reliability subject matter expert, providing architecture guidance, production readiness reviews, and proactive capacity planning
Mentor and coach SRE team members, helping them develop technical skills and operational judgment through pairing, code review, and incident response shadowing
Lead knowledge sharing initiatives, demos, and cross-team collaboration to elevate reliability culture and operational excellence across the engineering organization
Requirements
5+ years experience with cloud platforms (AWS or GCP) and container orchestration systems (Kubernetes/Docker)
Experience with cloud networking concepts and services including VPCs, subnets, security groups, NAT gateways, VPC peering, load balancers, and DNS management (Route 53, Cloud DNS)
Strong programming skills in one or more languages (Python, Go, Bash) with demonstrated ability to build automation and tooling
Advanced experience with Infrastructure as Code tools (Terraform, Helm, Ansible) including module design and organizational standards
Deep understanding of Linux systems administration and networking fundamentals (TCP/IP, DNS, load balancing, distributed systems)
Experience with SQL databases (PostgreSQL, MySQL, or SQL Server) including performance tuning and capacity planning
Experience designing and operating CI/CD pipelines for reliable software delivery
Track record of leading incident response and driving complex issues to resolution
Demonstrated ability to mentor engineers and contribute to team technical growth
Excellent collaboration and communication skills, with ability to influence technical decisions across teams.
Tech Stack
Ansible
AWS
Cloud
Distributed Systems
DNS
Docker
EC2
Google Cloud Platform
Grafana
Kafka
Kubernetes
Linux
MySQL
Postgres
Prometheus
Python
SQL
TCP/IP
Terraform
Vault
Go
Benefits
All interviews are being held remotely
If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.