Oracle is a leading company in AI and cloud solutions, and they are seeking a Principal Site Reliability Engineer to join their Cloud Infrastructure team. The role involves engineering infrastructure solutions for large-scale, highly distributed systems while ensuring availability and performance for customers, as well as collaborating with service owners on operational aspects.

Responsibilities:

Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas
Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services
Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance
Authority for end-to-end performance and operability
Partner with development teams in defining and implementing improvements in service architecture
Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio
Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack
Demonstrate clear understanding of automation and orchestration principles
Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs)
Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations
Understand and explain the effect of product architecture decisions on distributed systems
Professional curiosity and a desire to a develop deep understanding of services and technologies

Requirements:

US Citizenship
BS degree in Computer Science or related technical field involving coding or equivalent practical experience
Proficient with writing services/task automation in Python, Bash, Ruby, Perl, JavaScript, or Java
Proficient with communication skills (writing, organization, learning exchange)
Familiarity with core protocols (DNS, DHCP, HTTP, TCP)
Deep knowledge of Linux internals and host-based networking
Expert Linux/Unix performance and stability troubleshooting skills
Familiarity with configuration management solutions such as Chef, Puppet, etc
Experience with devising, managing, and extending monitoring solutions for large scale environments
Experience in database management (Oracle DB, MYSQL, Postgres)
Experience in shared file systems (Gluster, ZFS, etc.)
Systematic problem-solving approach, strong communication skills, a sense of ownership and drive
Deep understand of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
Experience working in an operational environment with mission critical tier one services with associated pager duty
5+ years managing large scale, highly distributed, services infrastructures
2+ years managing host virtualization technologies (KVM, Containers, Docker, etc.)
Experience supporting security applications, products and services in a cloud environment
Familiarization with USG system accreditation processes
Proficient in coding complex, distributed systems using Python, Ruby, Java, or C/C++
Deep knowledge of Networking (TCP, UDP, DNS, DHCP, IPSec)
Deep focus on building secure Internet facing systems and services in hostile environments
3+ years of experience in production software development with Agile methodologies
3+ years managing host, network, or storage virtualization technologies
Expert troubleshooting skills
Expert fleet automation and management solutions

Principal Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: