Oracle is a leading company in AI and cloud solutions, and they are seeking a Principal Software Engineer for their AI Infrastructure Innovation team. The role involves leading architecture and hands-on development of next-generation storage technologies, focusing on distributed systems and cloud platforms.
Responsibilities:
- Lead end-to-end architecture, system design, and implementation for distributed storage platforms
- Innovate on query processing, transaction, and IO performance, and work across different components - query planning/optimization, distributed execution engine, index and storage engine
- Develop production-grade, high-performance software features with rigorous durability, correctness, observability, and security
- Define performance goals and success metrics; design benchmarks and conduct large-scale experiments to validate throughput and latency
- Define consistency, replication, and recovery strategies
- Collaborate across storage, networking, compute, and control-plane teams to deliver cohesive end-to-end solutions on OCI
- Mentor engineers, provide technical leadership and reviews, and influence multi-year roadmap and technical standards
Requirements:
- Deep expertise in distributed systems with hands-on delivery of large-scale, fault-tolerant, strongly consistent services
- Experience building distributed execution engines
- Proven ability to design for global scale: sharding/partitioning, placement policies, rebalancing, and multi-region replication
- Strong software engineering background with performance profiling, correctness testing, and rigorous code quality
- Cloud architecture experience on a major public cloud, including observability, orchestration, and incident response
- BS/MS in Computer Science, Electrical/Computer Engineering, or equivalent practical experience; proven technical leadership and mentoring
- Familiarity with high-performance IO paths; understanding of cross-region networking and latency trade-offs
- Strong foundation in consensus and transactions
- Expertise with observability at scale: tracing, metrics, logs, eBPF/perf, chaos/failure testing, and SLO-driven operations
- Knowledge of AI/HPC workload patterns and their implications for storage, query processing, and consistency models