Ooma, Inc. empowers people to connect through their cloud-based communication platform. As a Site Reliability Engineer, you will ensure the stability and efficiency of systems by leveraging expertise in Linux Systems, containers, and CI/CD pipelines while collaborating with teams to implement best practices for infrastructure management.
Responsibilities:
- Monitor and troubleshoot system performance, reliability, and availability issues using modern observability tools and techniques, with a strong emphasis on diagnosing and resolving issues in operating systems and bare metal environments
- Design, implement, and maintain scalable and reliable infrastructure using containers, Kubernetes, and microservices architecture
- Manage CI/CD pipelines to facilitate efficient software development and deployment processes
- Implement GitOps workflows using ArgoCD or Flux, manage Helm charts and Kustomize configurations for declarative application deployment and version control
- Oversee configuration management to ensure consistent and reliable software releases across environments. Using Ansible for consistent system configuration, patch management, and provisioning across datacenter infrastructure
- Design and operate high-throughput Kafka clusters for event streaming, managing topics, partitions, replication, consumer lag monitoring, and disaster recovery strategies across datacenter infrastructure
- Collaborate with development teams to influence system design choices and operational policies
- Provide expert guidance on managing large data centers, including hundreds of bare metal servers and virtual machines (VMs), ensuring optimal configuration and performance
- Implement name services and server management practices to support our infrastructure needs
- Continuously evaluate and integrate new technologies to enhance operational efficiency and reliability
- Participate in on-call rotations to provide support for production systems as necessary, conduct blameless post-mortems with root cause analysis, and maintain incident response runbooks and procedures
- Create comprehensive technical documentation, runbooks, architectural diagrams, network topology maps, and maintain knowledge bases for operational procedures and best practices
- Continuously evaluate and integrate new technologies to enhance operational efficiency and reliability
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related field; advanced degree preferred
- 5+ years of experience as an SRE or a related field, with a strong focus on production systems, containers, microservices and service delivery
- Extensive experience with managing and maintaining CI/CD Pipelines and the essentials supporting it (GitOps workflows, ArgoCD, Helm charts)
- Comprehensive knowledge of Observability Tools such as Prometheus, ELK Stack, log collectors, and Grafana for visuals
- Extensive on-premises datacenter experience managing large data centers with hundreds of bare metal servers and VMs
- Deep knowledge of Linux operating systems, their configuration, performance tuning, and troubleshooting
- Experience with configuration management tools
- Familiarity with networking concepts and protocols in the scope of Linux Operating Systems
- Proven ability to analyze complex systems, identify bottlenecks, and implement solutions with strong troubleshooting skills
- Excellent communication skills, with the ability to collaborate effectively with cross-functional teams
- Experience with containers and orchestration technologies, particularly Kubernetes is a plus