Growe is a company focused on ensuring the availability, performance, and scalability of infrastructure and services. They are seeking a System Reliability Engineer/DevOps to lead incident response, manage infrastructure, and integrate security best practices while collaborating with various teams to enhance system reliability.

Responsibilities:

Ensure availability, performance, and scalability of infrastructure and services through monitoring, automation, and operational best practices
Lead incident response, perform root cause analysis, and implement recovery and long-term fixes
Manage infrastructure using Terraform, Terragrunt, and automation tools for consistency and repeatability
Implement and maintain metrics, logs, and tracing solutions (Prometheus, Grafana, Loki, VictoriaMetrics, CloudWatch) to ensure system visibility
Identify bottlenecks, tune systems, and improve infrastructure performance
Monitor resources, forecast growth, and implement scaling strategies
Integrate security best practices into IaC, CI/CD pipelines, and deployments
Support vulnerability management
Participate in 24/7 rotations (once a week) for timely resolution of critical incidents
Work with DevOps, PRE, development, and security teams to improve reliability and design resilient systems
Maintain operational runbooks, incident reports, and system documentation

Requirements:

3+ years in a DevOps, SRE, or related role
Strong hands-on experience with AWS services including EC2, ECS, EKS, RDS, DocumentDB, ElastiCache, Keyspaces, S3, EBS, VPC, Route53, KMS, ACM, and CloudWatch
Proficiency with Terraform, Terragrunt, and Atlantis for reproducible and version-controlled infrastructure
Experience with GitLab CI, FluxCD, Argo Rollouts, and automation tools (Ansible, Python, Bash)
Solid experience with Docker, Kubernetes (AWS EKS), and Helm (including custom templates, ChartMuseum)
Familiarity with cluster add-ons such as KEDA, VPA, Karpenter, External-DNS, ingress-nginx, aws-alb-controller, and ebs-csi-driver
Hands-on experience with Grafana, VictoriaMetrics stack, Tempo, metrics exporters, Pingdom, AWS CloudWatch, and alerting systems like PagerDuty, VMAlert, and Alertmanager
Proficiency with Grafana Loki, OpenSearch, and Vector Agent for centralized logging
Strong understanding of networking concepts, AWS networking (VPC, Network Firewall, Transit Gateway, Site-to-Site VPN), identity and access management, certificate management (ACM, Vault, SOPS), and application security best practices
Familiarity with Cloudflare services, including caching, DNS, and Workers
Exposure to AWS Cost Explorer, KubeCost, and custom cost export tools
Certifications: AWS, Terraform, Kubernetes, or Helm are a plus
Problem-Solving Mindset: Approaches complex issues methodically and finds practical solutions under pressure
Analytical Thinking: Able to interpret metrics, logs, and system behavior to make informed decisions
Attention to Details: Ensures accuracy in infrastructure changes, configurations, and deployment processes
Adaptability: Comfortable learning new tools, technologies, and adjusting to changing environments
Collaboration & Teamwork: Works effectively with cross-functional teams and communicates clearly
Ownership & Responsibility: Takes accountability for tasks, incidents, and service reliability
Continuous Learning: Keeps up-to-date with DevOps, SRE, cloud, and security best practices
Effective Communication: Can explain technical concepts clearly to both technical and non-technical stakeholders

System Reliability Engineer/DevOps

Key skills

About this role

Responsibilities:

Requirements: