Design and implement robust, scalable, and high-availability systems that meet business and technical requirements.
Collaborate with software engineering teams to integrate reliability into the software development lifecycle, ensuring that applications are built with operational excellence in mind.
Develop and maintain service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs) to measure system performance and reliability.
Lead incident response efforts, including post-mortem analysis and root cause investigations, to improve system reliability and prevent future incidents.
Automate operational processes to improve efficiency and reduce manual intervention, leveraging tools and technologies such as Infrastructure as Code (IaC).
Monitor system performance and reliability using appropriate metrics and monitoring tools, proactively identifying and addressing potential issues.
Advocate for and implement best practices in site reliability engineering, including capacity planning, disaster recovery, and incident management.
Train and mentor engineering and operations teams on SRE principles and practices, fostering a culture of continuous improvement.
Requirements
Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
8+ years of experience in software engineering, systems engineering, or site reliability engineering.
Strong understanding of cloud computing platforms (e.g., AWS, Azure, Google Cloud) and container orchestration technologies (e.g., Kubernetes, Docker).
Experience with configuration management and automation tools (e.g., Terraform, Ansible, Puppet).
Proficient in programming and scripting languages (e.g., Python, Go, Bash) for automation and tool development.
Extensive knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) and practices.
Solid understanding of networking concepts, distributed systems, and microservices architecture.
Excellent problem-solving skills and the ability to work effectively under pressure.
Tech Stack
Ansible
AWS
Azure
Cloud
Distributed Systems
Docker
Grafana
Kubernetes
Microservices
Prometheus
Puppet
Python
Terraform
Go
Benefits
Unlimited PTO
Paid Holidays
Onsite Fitness Center
Company Paid Life Insurance
Casual Dress Code
Competitive Pay
Health, Vision, and Dental Insurance
401(k) match. Pattern matches 100% of the first 3% in eligible compensation deferred and 50% of the next 2% in eligible compensation deferred.