Centene Corporation is a diversified national organization focused on improving health outcomes through technology. They are seeking a Senior Site Reliability Engineer to lead projects that enhance platform infrastructure performance and reliability, utilizing SRE practices and observability tools.
Responsibilities:
- Assists application development teams create a Disaster Recovery playbook
- Troubleshoots and resolves more complex problems with systems and services and initiates regular deployment of new versions of the systems and their subcomponents
- Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility
- Helps make decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
- Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
- Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time
- Leads lower level Engineers in stress, security, and performance testing
- Resolves issues that come up through support escalation
- Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise
- Leads post incident reviews and documents findings for future informed decision making
- Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward
- Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
- Performs other duties as assigned
- Complies with all policies and standards
Requirements:
- A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science)
- 4 – 6 years of related experience or equivalent experience acquired through accomplishments of applicable knowledge, duties, scope and skill reflective of the level of this position
- Assists application development teams create a Disaster Recovery playbook
- Troubleshoots and resolves more complex problems with systems and services and initiates regular deployment of new versions of the systems and their subcomponents
- Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility
- Helps make decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
- Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
- Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time
- Leads lower level Engineers in stress, security, and performance testing
- Resolves issues that come up through support escalation
- Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise
- Leads post incident reviews and documents findings for future informed decision making
- Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward
- Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
- Complies with all policies and standards
- Disaster Recovery
- AWS
- SQL
- MongoDB