Oracle is a leading company in AI and cloud solutions, dedicated to innovation that impacts billions of lives. They are seeking a Senior AI Site Reliability Engineer to design, build, and operate a next-generation AI-first Electronic Health Record platform, focusing on reliability, scalability, and automation in cloud operations.
Responsibilities:
- Work with the Site Reliability Engineering (SRE) team to take shared ownership of services and platform components. Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior
- Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
- Improve system reliability through automation, monitoring, and performance optimization
- Contribute to the adoption of AI-assisted approaches for operations, including: Enhancing observability and alerting, Supporting automated incident detection and remediation, Exploring intelligent automation for infrastructure lifecycle management
- Partner with development teams to enhance service architecture, scalability, and operability
- Participate in on-call rotations and act as an escalation point for complex production issues
- Perform root cause analysis and implement long-term fixes to prevent recurrence
- Apply knowledge of distributed systems to troubleshoot issues and optimize system performance
- Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale
- Implement and optimize infrastructure for Oracle HDI Analytics Platform
- Ensure system uptime, reliability, and scalability
- Design and implement GenAI-powered or agent-based solutions for: Observability and anomaly detection, Incident triage and remediation, Infrastructure provisioning and lifecycle management
- Build tools and frameworks that enable self-service and autonomous operations
- Build and optimize scalable data pipelines using Vertica and ETL frameworks
- Apply DevOps/SRE practices to automate deployments and operations
- Enhance observability using Prometheus/Grafana and AI-driven insights
- Support multi-cloud initiatives across OCI, AWS, and Azure
- Optimize cost, performance, and compliance across environments
- Participate in on-call rotations
- Implement preventative and automated remediation solutions
- Work closely with engineers to execute technical roadmaps
- Contribute to code reviews and infrastructure improvements
Requirements:
- 3 to 5+ years of experience in software engineering, cloud infrastructure, SRE, or DevOps
- U.S. citizenship is required for this position
- Experience building and operating high-availability, fault-tolerant systems
- Strong understanding of distributed systems, performance monitoring, and resiliency patterns
- Experience with incident response, root-cause analysis, and production troubleshooting
- Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to infrastructure lifecycle management, observability and anomaly detection, incident response and remediation automation
- Ability to design or integrate AI-driven workflows for operational efficiency and reliability
- Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
- Strong experience with multi-cloud environments (OCI, AWS/Azure)
- Deep understanding of cloud infrastructure design, deployment, and resource optimization
- Experience managing hybrid or cross-cloud architectures
- Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
- Infrastructure as Code (Terraform)
- Observability tools (Prometheus, Grafana)
- Strong focus on automation-first operations
- Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
- Experience with ETL frameworks and large-scale data processing
- Understanding of columnar storage systems
- Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
- Strong proficiency in Python, Java, or Go
- Experience with Docker, Kubernetes, and shell scripting
- Strong troubleshooting skills with ability to perform root-cause analysis
- Experience resolving complex production issues in distributed systems
- Experience in healthcare or regulated environments (HIPAA, compliance frameworks)
- Experience working in environments requiring security clearance
- Experience building self-healing or autonomous infrastructure systems