Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. As a Site Reliability Engineer (SRE), you will ensure the reliability, scalability, and performance of systems, applications, and infrastructure while automating processes and improving system reliability.
Responsibilities:
- Build, maintain, and operate the AWS hosted platform
- Work closely with dev teams to Identify and measure SLOs, SLAs and SLIs
- Contributor to development of platform services including architecture, provisioning, configuration, deployment, and support
- Integration with centralized logging, metrics dashboards, instrumentation, incident monitoring and management
- Participate in on-call rotation for incident resolution for the platform and/or any dependent components
- React to production deficiencies by continuously implementing automation, self-healing, and real-time monitoring to production systems
- Maintain operational tooling, frameworks
- Perform root cause analysis and deliver resolution for tools and automation failures
- Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
- Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
- Conduct postmortem after production issues
Requirements:
- 5+ years of experience in software engineering
- 3+ years of scripting experience in Python or Powershell
- 3+ years of experience with Linux system administration and shell scripting
- 3+ years of experience with networking fundamentals including VPN setup, routing, security groups, cross-cloud connectivity
- 2+ years of experience with AWS services: EC2, VPC, IAM, Lambda, S3, CloudWatch
- 2+ years of experience with Infrastructure-as-Code: Terraform, AWS CloudFormation, CDK
- If you are offered this position, you will be required to provide extensive personal information to obtain and maintain a suitability or determination of eligibility for a Confidential/Secret or Top Secret security clearance as a condition of your employment
- United States Citizenship
- Bachelor's degree in Information Technology, Computer Science or related field
- 1+ years of experience with CI/CD pipeline basics using Git and GitLab
- 1+ years of experience monitoring and alerting with CloudWatch and Dynatrace
- 1+ years of experience with containerized workloads (ECS, EKS, etc)
- Experience with security and compliance frameworks: FedRAMP Moderate, NIST 800-171
- Use of AI-driven anomaly detection in CloudWatch for proactive issue resolution
- Automation of patching and scaling using predictive models as well as supporting infrastructure for AI-based applications
- All employees working remotely will be required to adhere to UnitedHealth Group's Telecommuter Policy