Forcepoint simplifies security for global businesses and governments. They are seeking a Site Reliability Engineer to standardize key principles across their products and maintain the reliability of their services for customers.
Responsibilities:
- Monitor, measure and improve the reliability, availability and scalability of Forcepoint products and infrastructure
- Engage in Incident response and participate in post-mortem analysis to investigate root cause and capture contributing factors for remediation
- Perform analytics on previous incidents and trend/usage patterns to better predict issues and take proactive actions
- Design and build custom tools as needed to support process optimization, challenging the status-quo and improving operational efficiency
- Participate in 24•7 rotational shifts & On-Call for handling production operation issues
- Identify manual routine operational practices and build robust automation capabilities using code and modern tools
- Review and create dashboards/reports for application telemetry and infrastructure health for pro-actively identifying performance constraints and bottlenecks
- Monitor product performance and availability, and provide feedback to develop, test, and implement robust monitoring, alerting, and logging solutions
- Work collaboratively with software developers to promote best practices in reliability and operability, including code reviews and architectural discussions
- Participate with stakeholders to monitor our products, ensuring that the products meet architecture & observability design requirements
Requirements:
- Strong understanding of cloud-based architecture and operations. Hands-on experience with Amazon Web Services is preferred
- Experience in administration/build/management of Linux systems
- Foundational understanding of Infrastructure and Platform Technology stacks
- Strong understanding of Networking concepts and theories, such as different protocols (TCP/IP, UDP, routing protocols, etc), VLAN configuration, DNS, OSI layers, and load balancing
- Understanding of security architecture and certificate management
- Working knowledge of Infrastructure and Application monitoring platforms such as Grafana Cloud, Xymon, LibreNMS etc
- Working knowledge of Incident Response and Alerting platforms such as PagerDuty, Opsgenie, XMatters etc
- Understanding of the core DevOps practices (CI/CD pipeline, release management etc.)
- Ability to write code using any one modern programming language (Python, JavaScript, Ruby etc.). Additional scripting skills are preferred
- Configuration management platform understanding and experience (Chef/Puppet/Ansible)
- Prior experience in Cloud management automation tools (Terraform/CloudFormation etc.)
- Experience with source code management software and API automation is crucial
- Cloud certifications or equivalent experience is highly regarded
- Service availability oriented mindset with a pro-active approach to problem solving. An ideal candidate should be able to develop automated solutions to prevent recurring problems
- Possesses the ability and willingness to challenge the status-quo and optimize current procedures and processes
- Strong sense of ownership and an ability to drive cross-functional process improvement
- Possesses excellent inter-personal, written and verbal communications skills
- Analytical and logical approach to problem-solving and a willingness to automate repetitive tasks and reduce manual/re-active workload
- Ability and willingness to coach and mentor Team members and colleagues