Working on Internet technologies to improve the performance, availability, and scalability of large distributed content delivery systems
Engaging in collaborative efforts with cross-functional teams to define and establish measurable Service Level Indicators and Service Level Objectives
Monitoring platform availability and performance, debug issues by leveraging data analysis skills and implement corrective actions to avoid recurrence
Developing and implement automation solutions to improve operational efficiency and reduce toil.
Improving CI/CD pipelines and safe deployment practices for platform services.
Participating in design reviews and providing technical guidance to ensure designs meet requirements for scalability, performance, and robustness

Have 2 years of relevant experience and a Bachelor's degree in Computer Science or its equivalent
Have hands-on experience with compute platforms such as Kubernetes, Containerization, and Docker
Have experience with monitoring and alerting systems (e.g., Prometheus, Grafana, ADBMS, Datadog), including metric collection, alerting, dashboarding, and troubleshooting
Show fluency working in a UNIX/Linux computing environment
Have familiarity with infrastructure-as-code tools such as Terraform
Have proficiency with a configuration management tool such as Ansible, Salt Stack, Chef, Puppet, or similar

Site Reliability Engineer – II

Key skills