Stage 4 Solutions is a global B2B high-tech company seeking a Site Reliability Engineer. The role involves data monitoring, quality assurance, and improving system performance through various technical operations and processes.
Responsibilities:
- Data monitoring and alerting, data quality assurance and anomaly detection
- Document team processes and policies, including methods of engagement and SLOs
- Analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance
- Implement monitoring and alerting to improve issue detection and response
- Work in a fast-paced environment. Participate in technical operations and rotations in response to performance and reliability issues
- Participate in on-call rotations, responsible for resolving or escalating incoming events
- Maintain and operate a Linux and Kubernetes environment
Requirements:
- 3+ years of experience working with Unix and Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols
- Experience reading Python scripts for platform operations
- Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier-grade environment
- Experience in developing and operating one or more of the following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc
- Bachelor's Degree