Ensure high availability and maximum uptime in SaaS environments by proactively monitoring and managing infrastructure, applications, and network systems using tools like Zabbix, Grafana, Prometheus, Sumo Logic, and Amazon CloudWatch.
Proactively identify anomalies, threshold breaches, and performance bottlenecks to prevent potential incidents.
Respond to alerts and incidents in real time, perform initial triage and troubleshooting across Windows/Linux servers, applications, and network components, validate service health (APIs/endpoints), and escalate complex issues with detailed analysis.
Validate application availability through endpoint checks, API monitoring, and service health verification.
Execute routine operations including system health checks, patching, maintenance, backups, and disaster recovery using cloud-native tools (e.g., AWS Backup, S3, RDS), while also monitoring resource utilization and supporting cost optimization initiatives.
Contribute to the creation, review, and continuous improvement of SOPs, runbooks, and knowledge base articles.
Participate in Change Management and Change Control processes by implementing approved changes, validating deployments, and minimizing risk to production systems.
Maintain accurate records in ticketing systems, ensuring proper documentation aligned with ITIL processes and audit requirements.
Participate in shift handovers, ensuring clear communication of ongoing incidents, system status, and pending actions.
Ensure strict adherence to SLAs, operational processes, security guidelines, and compliance standards.
Requirements
Associate’s or Bachelor’s degree in Information Technology, Networking, or a related field (or equivalent practical experience)
2–5 years of experience in a NOC, Cloud Operations, or Network Support environment
Solid understanding of networking fundamentals, including TCP/IP, DNS, DHCP, and VPN concepts
Strong knowledge of operating systems, particularly Linux and Windows environments
Hands-on experience or familiarity with monitoring and observability tools such as Zabbix, Grafana, Prometheus, Sumo Logic, Datadog, ELK stack … etc.
Good understanding of ITIL practices (Incident, Change, and Problem Management) in an operations environment
Experience troubleshooting SaaS application performance, system reliability, and cloud-based service disruptions.
Basic scripting knowledge (Shell or Python) is an advantage for automation and operational efficiency
Willingness to work in a 24/7 shift-based support model
Effective communication and documentation skills for incident reporting, escalations, and knowledge sharing
Tech Stack
AWS
Cloud
DNS
Grafana
Linux
Prometheus
Python
TCP/IP
Benefits
Company stocks
Annual merit increase based on performance
15% night shift differential pay
Paid Leave with Cash Conversion
HMO with free dependents
Retirement Plan
Life Insurance
While on work from home setup: Internet and meal allowance are provided
Employee Assistance Program for mental and social well-being
Government-mandated Benefits (SSS, PhilHealth, PagIBIG, 13th month pay, Solo parent leave, Special leave for women)