Role Overview
We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).
This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.
As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.
ResponsibilitiesMonitoring & Observability (Core Focus)
- Own and operate the monitoring and observability stack across on-prem and GCP environments
- Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Establish observability standards and best practices across teams
- Improve visibility into system health, performance, and reliability
Site Reliability Engineering
- Apply SRE principles to improve availability, performance, and resilience
- Define and track SLIs, SLOs, and error budgets
- Participate in on-call rotations and SEV incident response
- Lead or contribute to incident investigations and root cause analysis (RCA)
- Drive preventative actions to reduce repeat incidents
Kubernetes & Platform Reliability
- Support and monitor Kubernetes environments (GKE and on-prem clusters)
- Monitor cluster health, capacity, and resource utilization
- Troubleshoot platform-level issues impacting application reliability
- Collaborate with Platform and Engineering teams on reliability improvements
Secondary Responsibilities (Backup Application Support)
- These responsibilities are activated as needed, not part of day-to-day operations.
- Provide L2/L3 application support coverage during:
- Support team resource shortages
- High-severity incidents (SEVs)
- Peak support periods or escalations
- Triage and troubleshoot application issues using existing runbooks and dashboards
- Collaborate with Application Support and Engineering teams during incidents
- Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)
Requirements
- Strong experience as a Site Reliability Engineer or Reliability Engineer
- Deep hands-on expertise with **Grafana **(dashboards, alerting, troubleshooting)
- Solid experience with monitoring and observability systems
- Production experience operating **Kubernetes **environments
- Experience supporting systems in **GCP **and on-prem environments
- Strong **Linux **systems and troubleshooting skills
- Fluent **English **(written and spoken).
- Ability to work in** PST time zone.**
- Ability to participate in an **on-call rotation **that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.
Technology Stack:
- Observability: Grafana, Prometheus, logging platforms
- Containers: Kubernetes (GKE and on-prem)
- Cloud: Google Cloud Platform (GCP)
- Operations: Linux, networking, infrastructure monitoring
- Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)
Nice to have:
- Experience supporting application teams during SEV incidents
- Knowledge of capacity planning and performance tuning
- Scripting skills (Python, Bash, etc.)
- Experience with hybrid infrastructure environments
Tech Stack
- Cloud
- Google Cloud Platform
- Grafana
- Kubernetes
- Linux
- Prometheus
- Python
- ServiceNow
Benefits
At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you’ll enjoy:
- A stable, long-term contract with opportunities for career growth
- Private health insurance
- A remote-friendly culture that promotes work-life balance
- Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
- Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
- A flexible Paid Time Off (PTO) policy as well as paid holiday days
- Challenging, world-class software projects for clients in the US and LatAm
- Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment
Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.