Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. As a Site Reliability Engineer (SRE), you will ensure the reliability, scalability, and performance of systems, applications, and infrastructure while automating processes and improving system reliability.

Responsibilities:

Build, maintain, and operate the AWS hosted platform
Work closely with dev teams to Identify and measure SLOs, SLAs and SLIs
Contributor to development of platform services including architecture, provisioning, configuration, deployment, and support
Integration with centralized logging, metrics dashboards, instrumentation, incident monitoring and management
Participate in on-call rotation for incident resolution for the platform and/or any dependent components
React to production deficiencies by continuously implementing automation, self-healing, and real-time monitoring to production systems
Maintain operational tooling, frameworks
Perform root cause analysis and deliver resolution for tools and automation failures
Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
Conduct postmortem after production issues

Requirements:

5+ years of experience in software engineering
3+ years of scripting experience in Python or Powershell
3+ years of experience with Linux system administration and shell scripting
3+ years of experience with networking fundamentals including VPN setup, routing, security groups, cross-cloud connectivity
2+ years of experience with AWS services: EC2, VPC, IAM, Lambda, S3, CloudWatch
2+ years of experience with Infrastructure-as-Code: Terraform, AWS CloudFormation, CDK
If you are offered this position, you will be required to provide extensive personal information to obtain and maintain a suitability or determination of eligibility for a Confidential/Secret or Top Secret security clearance as a condition of your employment
United States Citizenship
Bachelor's degree in Information Technology, Computer Science or related field
1+ years of experience with CI/CD pipeline basics using Git and GitLab
1+ years of experience monitoring and alerting with CloudWatch and Dynatrace
1+ years of experience with containerized workloads (ECS, EKS, etc)
Experience with security and compliance frameworks: FedRAMP Moderate, NIST 800-171
Use of AI-driven anomaly detection in CloudWatch for proactive issue resolution
Automation of patching and scaling using predictive models as well as supporting infrastructure for AI-based applications
All employees working remotely will be required to adhere to UnitedHealth Group's Telecommuter Policy

Sr. Site Reliability Engineer - Remote

Key skills

About this role

Responsibilities:

Requirements: