Oracle is a leading company in AI and cloud solutions, and they are seeking a Principal Site Reliability Engineer to join their Cloud Infrastructure team. The role involves engineering infrastructure solutions for large-scale, highly distributed systems while ensuring availability and performance for customers, as well as collaborating with service owners on operational aspects.
Responsibilities:
- Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas
- Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services
- Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance
- Authority for end-to-end performance and operability
- Partner with development teams in defining and implementing improvements in service architecture
- Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio
- Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack
- Demonstrate clear understanding of automation and orchestration principles
- Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs)
- Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations
- Understand and explain the effect of product architecture decisions on distributed systems
- Professional curiosity and a desire to a develop deep understanding of services and technologies
Requirements:
- US Citizenship
- BS degree in Computer Science or related technical field involving coding or equivalent practical experience
- Proficient with writing services/task automation in Python, Bash, Ruby, Perl, JavaScript, or Java
- Proficient with communication skills (writing, organization, learning exchange)
- Familiarity with core protocols (DNS, DHCP, HTTP, TCP)
- Deep knowledge of Linux internals and host-based networking
- Expert Linux/Unix performance and stability troubleshooting skills
- Familiarity with configuration management solutions such as Chef, Puppet, etc
- Experience with devising, managing, and extending monitoring solutions for large scale environments
- Experience in database management (Oracle DB, MYSQL, Postgres)
- Experience in shared file systems (Gluster, ZFS, etc.)
- Systematic problem-solving approach, strong communication skills, a sense of ownership and drive
- Deep understand of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
- Experience working in an operational environment with mission critical tier one services with associated pager duty
- 5+ years managing large scale, highly distributed, services infrastructures
- 2+ years managing host virtualization technologies (KVM, Containers, Docker, etc.)
- Experience supporting security applications, products and services in a cloud environment
- Familiarization with USG system accreditation processes
- Proficient in coding complex, distributed systems using Python, Ruby, Java, or C/C++
- Deep knowledge of Networking (TCP, UDP, DNS, DHCP, IPSec)
- Deep focus on building secure Internet facing systems and services in hostile environments
- 3+ years of experience in production software development with Agile methodologies
- 3+ years managing host, network, or storage virtualization technologies
- Expert troubleshooting skills
- Expert fleet automation and management solutions