Role Overview

Monitor Production Systems: Use monitoring tools (e.g., Cloud Monitoring) to ensure the health and performance of cloud-based production systems on Google Cloud Platform (GCP).
Incident Management: Respond to production incidents, triage issues, and ensure timely resolution. Perform root cause analysis (RCA) and document findings.
Performance Tuning: Analyze system performance, identify bottlenecks, and make recommendations for improvements to optimize service reliability, scalability, and speed.
System Alerts and Incident Escalation: Set up and maintain system alerts to proactively detect issues. Escalate critical issues to appropriate teams and ensure swift resolution.
Collaboration with Engineering: Work closely with development and operations teams to ensure smooth production releases, provide feedback on system performance, and implement monitoring solutions for new services.
System Documentation: Maintain documentation related to system configurations, monitoring setups, and incident resolutions to create knowledge-sharing practices across teams.
Service Level Agreements (SLAs): Track and report on SLA performance, ensuring that production services meet predefined availability and reliability standards.
Proactive System Health Checks: Conduct routine system health checks, reviewing logs and performance metrics, to ensure system uptime.
Disaster Recovery and Backup: Monitor backup systems and ensure that disaster recovery procedures are in place and tested.

Requirements

3+ years experience in cloud production support, Site Reliability Engineering, or System Reliability roles
3+ years hands-on experience with Google Cloud Platform (GCP), including Compute Engine, GKE, Cloud Monitoring, Logging, and Storage
3+ years experience using monitoring and observability tools to track system health and performance
3+ years experience in system performance metrics (CPU, memory, disk, network) and issue diagnosis
3+ years experience managing incidents and troubleshooting live production systems
3+ years experience in scripting or automation using Bash, Python, or similar languages
Strong experience with VoIP and UC technologies including SIP, RTP/SRTP, WebRTC, SBCs (Ribbon, Oracle, AudioCodes), SIP trunks, gateways, and voice codecs (G.711, G.729)
Proven ability to troubleshoot IP telephony and real-time communications using tools such as Wireshark and network analyzers
Solid understanding of network fundamentals (TCP/IP, VLANs, routing, switching, QoS) and voice security best practices (TLS, SRTP, firewalls)
Experience integrating voice, contact center (ACD/IVR), and UC platforms within cloud-native and hybrid environments
Proficiency in automation and scripting for voice and system management (Python, Bash, PowerShell)
Experience with observability and monitoring tools (Prometheus, Grafana, Zabbix, Elastic Stack)
Hands-on exposure to network and VoIP analysis tools such as Netscout NG1 and Wireshark
Familiarity with automation and CI/CD tools (Ansible, N8N, Jenkins, GitLab CI/CD)
Exposure to multi-cloud environments (AWS, Azure)
Certifications (Preferred) CCNA (Collaboration) or CompTIA Network+ Cloud certifications (GCP, AWS, or Azure)

Tech Stack

Ansible
AWS
Azure
Cloud
Firewalls
Google Cloud Platform
Grafana
Jenkins
Oracle
Prometheus
Python
Swift
Switching
TCP/IP
VoIP

Benefits

Work From Home Set-up
Night Shift (8PM to 5AM), rotating weekend shifts

Site Reliability Engineer, GCP

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits