Plume is a pioneering technology company that has built an innovative service delivery platform for smart homes and businesses. They are seeking an experienced Site Reliability Engineer to develop and implement tools and processes that ensure the stability and reliability of their systems while collaborating with product engineers.
Responsibilities:
- Implement, manage, and maintain scalable and reliable infrastructure using infrastructure-as-code (IaC) tools
- Develop and implement observability solutions to help engineers ensure high availability and performance of all services
- Design, build, and maintain CI/CD pipelines to streamline the deployment process
- Collaborate with development teams to ensure services are designed with operability and reliability in mind
- Participate in a global on-call rotation to provide support for critical production systems
- Drive down operational toil through automation and process improvements
- Manage and optimize cloud resources for cost efficiency and performance
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
- 5+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role
- Expertise in one or more programming languages (e.g., Python, Go)
- Experience with one or more cloud computing platforms (e.g., AWS, GCP)
- Proficiency with configuration management and IaC tools (e.g., Terraform, Salt)
- Proficiency with Kubernetes-based environments
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, OpenSearch)
- Excellent technical communication skills
- Experience with large-scale distributed systems
- Knowledge of networking and security best practices
- Familiarity with database technologies (SQL and NoSQL)