CVS Health is dedicated to building a world of health around every individual, and they are seeking an Infrastructure Site Reliability Engineer. This role involves designing, implementing, and managing infrastructure systems to ensure the reliability and performance of technology platforms supporting various business initiatives.

Responsibilities:

Manage and maintain various systems and infrastructure, such as servers, storage, mainframe, iSeries, backup, archive, and recovery, ensuring the platforms have high availability, scalability, and reliability to meet the business requirements
Participate in on-call rotation to ensure availability and uptime of critical systems and provide timely response and resolution to incidents
Develop and maintain best practices documentation, including system architecture diagrams, standard operating procedures, and runbooks
Perform system and application performance analysis, utilizing monitoring tools, logging systems, and other relevant metrics, to identify and resolve issues and enhance overall system performance
Streamline and optimize operational processes, procedures, and documentation by implementing industry best practices
Develop, modify, and implement incident and problem management processes to increase efficiency and reduce downtime
Establish a comprehensive SRE process that encompasses the entire software team, ensuring seamless operations and prompt resolution of any escalated issues
Collaborate with development teams to participate in code reviews, performance optimization, and application deployment processes
Drive reliability engineering practices, including monitoring, alerting, incident management, capacity planning, and disaster recovery
Automate infrastructure deployments, upgrades, and maintenance tasks, utilizing configuration management tools like Ansible and infrastructure-as-code frameworks such as Terraform
Stay abreast of industry trends, emerging technologies, and best practices in infrastructure site reliability engineering and apply knowledge to continually improve CVS Health's systems and processes
Provide customer support with meticulously documented procedures, enabling them to proficiently address customer complaints and deliver optimal service
Analyze historical usage patterns and growth projections to forecast future capacity requirements
Collaborate with stakeholders such as developers, product managers, and operations teams to understand the demand for resources and estimate the necessary infrastructure capacity
Establish and maintain monitoring systems to track the performance and utilization of critical resources
Identify potential bottlenecks, anomalies, or areas of improvement
Perform regular performance reviews help ensure systems meet defined service-level objectives (SLOs) and key performance indicators (KPIs)

Requirements:

7+ years of experience in Infrastructure Engineering, System Administration, or related roles
3+ years of experience with cloud platforms (e.g., Amazon Web Services, Microsoft Azure) and infrastructure-as-code tools (e.g., Terraform, CloudFormation)
2+ years of experience in at least one configuration management tool such as Ansible, Puppet, or Chef
2+ years of experience with containerization technologies such as Docker and container orchestration platforms like Kubernetes
2+ years of experience in networking principles and protocols, including TCP/IP, DNS, load balancing, and firewalls
1+ years of experience with incident management, performance monitoring, and capacity planning tools
Excellent troubleshooting and problem-solving skills, with the ability to identify, communicate, and resolve technical issues swiftly

Infrastructure Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: