Oracle is a leader in AI and cloud solutions, and they are seeking a Principal AI Infrastructure Reliability Engineer to join their Health Data Intelligence team. The role focuses on designing, building, and operating reliable infrastructure for large-scale healthcare analytics, while advancing automation and AI-assisted reliability practices.
Responsibilities:
- Work with the Site Reliability Engineering (SRE) team to take shared ownership of services and platform components
- Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior
- Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
- Improve system reliability through automation, monitoring, and performance optimization
- Contribute to the adoption of AI-assisted approaches for operations, including: Enhancing observability and alerting
- Supporting automated incident detection and remediation
- Exploring intelligent automation for infrastructure lifecycle management
- Partner with development teams to enhance service architecture, scalability, and operability
- Participate in on-call rotations and act as an escalation point for complex production issues
- Perform root cause analysis and implement long-term fixes to prevent recurrence
- Apply knowledge of distributed systems to troubleshoot issues and optimize system performance
- Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale
- Implement and optimize infrastructure for Oracle HDI Analytics Platform
- Ensure system uptime, reliability, and scalability
- Design and implement GenAI-powered or agent-based solutions for: Observability and anomaly detection
- Incident triage and remediation
- Infrastructure provisioning and lifecycle management
- Build tools and frameworks that enable self-service and autonomous operations
- Build and optimize scalable data pipelines using Vertica and ETL frameworks
- Apply DevOps/SRE practices to automate deployments and operations
- Enhance observability using Prometheus/Grafana and AI-driven insights
- Support multi-cloud initiatives across OCI, AWS, and Azure
- Optimize cost, performance, and compliance across environments
- Participate in on-call rotations
- Implement preventative and automated remediation solutions
- Work closely with engineers to execute technical roadmaps
- Contribute to code reviews and infrastructure improvements
Requirements:
- U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire
- Experience building and operating high-availability, fault-tolerant systems
- Strong understanding of distributed systems, performance monitoring, and resiliency patterns
- Experience with incident response, root-cause analysis, and production troubleshooting
- Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
- Ability to design or integrate AI-driven workflows for operational efficiency and reliability
- Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
- Strong experience with multi-cloud environments (OCI, AWS/Azure)
- Deep understanding of cloud infrastructure design, deployment, and resource optimization
- Experience managing hybrid or cross-cloud architectures
- Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
- Infrastructure as Code (Terraform)
- Observability tools (Prometheus, Grafana)
- Strong focus on automation-first operations
- Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
- Experience with ETL frameworks and large-scale data processing
- Understanding of columnar storage systems
- Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
- Strong proficiency in Python, Java, or Go
- Experience with Docker, Kubernetes, and shell scripting
- Strong troubleshooting skills with ability to perform root-cause analysis
- Experience resolving complex production issues in distributed systems
- 10+ years of software engineering experience, with 8+ years in cloud infrastructure, SRE, or DevOps
- Proven ownership of production system reliability in cloud environments
- Cloud infrastructure design and automation
- Distributed systems and performance optimization
- Data warehousing and ETL frameworks
- Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
- Experience building or integrating AI-powered automation for DevOps/SRE workflows
- Familiarity with tools like LangChain, AutoGPT, or custom AI agents
- Terraform, Docker, Kubernetes
- Observability stacks (Prometheus, Grafana)
- Python, Java, or Go
- Strong problem-solving mindset with a focus on automation and scalability
- Experience improving system reliability through intelligent automation
- Experience in healthcare or regulated environments (HIPAA, compliance frameworks)
- Familiarity with Oracle HDI or large-scale analytics platforms
- Experience working in environments requiring security clearance
- Experience building self-healing or autonomous infrastructure systems