Oracle is seeking a Senior Principal Network Development Engineer to lead backend NIC qualification and New Product Introduction for next-generation networking platforms. This role involves driving cross-organizational initiatives and ensuring NIC technologies meet stringent performance requirements for OCI’s AI superclusters.
Responsibilities:
- Own the end-to-end qualification strategy and execution for backend NICs supporting OCI AI clusters (RDMA/RoCE-based fabrics)
- Lead NIC NPI for AI infrastructure, from early silicon bring-up through fleet-wide deployment across OCI regions
- Define validation methodologies for high-performance, low-latency distributed training workloads (e.g., GPU collectives, east-west traffic patterns)
- Drive deep performance characterization and tuning of NICs in AI cluster environments (latency, throughput, tail latency, congestion behavior)
- Partner with NIC and silicon vendors (e.g., NVIDIA/Mellanox, Broadcom, Intel) to resolve complex hardware/firmware issues and influence feature design
- Collaborate with OCI AI/ML platform, cluster networking, and host software teams to ensure optimal integration with drivers, kernel, and user-space stacks
- Build and scale automated validation frameworks for continuous qualification across rapidly evolving AI hardware generations
- Lead root cause analysis (RCA) for systemic issues impacting cluster performance or reliability, driving fixes across stack layers
- Establish qualification gates, acceptance criteria, and release readiness processes aligned with OCI production standards
- Use production telemetry and workload insights to inform validation strategy and proactively identify risk areas
- Influence NIC and system architecture to meet the demands of next-generation AI workloads
Requirements:
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
- 8–12+ years of experience in networking, systems engineering, or hardware validation in large-scale distributed environments
- Deep expertise in NIC architecture and advanced features (RDMA/RoCE, congestion control, SR-IOV, queueing, offloads)
- Strong understanding of distributed systems networking for AI/ML workloads (e.g., collective communication patterns, east-west traffic scaling)
- Advanced knowledge of Linux networking stack and kernel-level debugging
- Proven experience leading hardware qualification and/or NPI efforts in data center or cloud environments
- Strong debugging skills across hardware, firmware, driver, and system layers
- Proficiency in automation and tooling (Python, Bash, or similar)
- Experience with performance benchmarking and traffic analysis in high-scale environments
- Experience with AI/HPC networking (e.g., RoCEv2, InfiniBand concepts, GPU cluster networking)
- Familiarity with distributed training frameworks (e.g., NCCL) and their network behavior
- Knowledge of PCIe, NUMA, and GPU/accelerator interconnect considerations
- Experience in hyperscale cloud environments
- Exposure to SmartNICs, DPUs, or offload-driven architectures
- Experience building validation pipelines integrated with CI/CD systems
- Background in large-scale cluster bring-up and production operations