Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria
Develop automated solutions to address potential problems before they result in a service interruption
Provide impact assessment and mitigation plan for changes going into the production environment
Investigate root cause of severe and systemic outages, identify corrective actions and apply across the enterprise
Develop availability measures that align with consumer experience to accurately assess the usability of crucial services
Build capacity models to baseline transactional load compared to resource performance and leverage data to predict overall system capacity while automating load placement to avoid outages
Identify thresholds for all critical links in the data path to quickly isolate where imbalances may result in potential outages
Analyze failure points in services to model risk level and resolution steps if failure occurs
Assist in driving architecture enhancements into system to mitigate potential failure points
Programmatically monitor for and remediate configuration drift of critical devices
Develop response plans to potential failure points and evaluate effectiveness during planned tests
Perform comprehensive operational health checks of the entire services to identify areas of concern and track activities to drive improvements at all levels of the architecture
Provide technical coaching and direction to more junior teammates

Bachelor’s degree from accredited university or college with minimum of 4 years of professional experience OR Associates degree with minimum of 7 years of professional experience OR High School Diploma with minimum of 9 years of professional experience
Legal authorization to work in the U.S. is required
Excellent knowledge of AWS/Azure cloud services
Strong oral and written communication skills
Demonstrated experience scripting or developing software and services for the cloud (Python, Go, Java, Node.js, .NET, etc.)
Extensive knowledge of network protocols (TCP/IP, SNMP, FTP, syslog, TFTP, etc.)
Experience managing version control systems such as Git
Experience deploying and managing infrastructure on public clouds such as AWS or Azure
Experience using an automated configuration management system (Terraform, Chef, Puppet, Ansible, Salt, etc.)
Strong organizational and project management skills
Strong analytical and problem resolution skills
Excellent knowledge of Network Management (SNMP, MIB)
Experience with configuring, customizing, and extending monitoring tools (Datadog, Sensu, Grafana, Splunk, etc.)
Excellent knowledge of TCP/IP networking, and inter-networking technologies (routing/switching, proxy, firewall, load balancing, etc.)
Knowledge and experience using Analytics Software Packages (like Matlab, SAS, JMPro) is a plus.

Staff Site Reliability Engineer

Key skills