Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. They are seeking a highly experienced Principal Software Engineer to lead the design, development, and upstream enablement of their AI and HPC communication middleware stack.
Responsibilities:
- Lead design and implementation enabling and optimizing HPC middleware (MPI and SHMEM) and AI middleware CCL stacks (e.g., NCCL/RCCL and related collective communication libraries)
- Deliver performance-critical communication paths including low-latency small and medium message transfers, bulk SDMA data movement, GPU-Direct and IPC communication, and collective acceleration
- Design and tune collective communication algorithms (latency-optimized and bandwidth-optimized), including GPU-aware collectives
- Integrate middleware with underlying transports and provider layers such as libfabric/OFI, UCX, and verbs-style interfaces to achieve performance, portability, and maintainability
- Implement and optimize memory registration strategies, progress and execution models, completion semantics, multi-rail communication behavior, and GPU memory handling
- Drive upstream contributions across MPI/SHMEM projects, CCL ecosystems, and related components with a focus on upstreamable design and long-term maintainability
- Represent Cornelis Networks in open-source communities through technical reviews, design discussions, and sustained technical leadership
- Implement and prototype Ultra Ethernet capabilities supporting MPI/SHMEM and AI collective communication use cases
- Collaborate with ecosystem partners to validate deployment models and performance scaling on customer-relevant configurations
- Work closely with kernel, driver, and switch teams to deliver end-to-end performance aligned with the Cornelis product roadmap
- Participate in architecture reviews, performance tuning, scaling validation, and multi-layer root-cause investigations
- Analyze performance traces and triage advanced customer issues, translating findings into robust fixes and upstream improvements
- Publish internal and external best practices, including tuning guidance, reference configurations, and debugging methodologies
- Mentor senior engineers and promote best practices for design, testing, documentation, and code quality
- Help define the long-term middleware technical roadmap aligned with product evolution and customer needs
Requirements:
- 12+ years of experience in high-performance systems programming in C/C++ on Linux
- Hands-on experience with MPI internals (Open MPI, MPICH, MVAPICH) and/or SHMEM implementations
- Experience implementing or optimizing collective communications for HPC and/or AI workloads, including NCCL/RCCL (CUDA/ROCm) or related CCL stacks
- Demonstrated ability to design low-latency/high-throughput communication paths and diagnose performance issues using profiling and tracing tools
- Working knowledge of transport and integration layers such as OFI/libfabric, UCX, and verbs-style networking concepts
- Strong understanding of RDMA and performance tuning
- Proven open-source contribution track record
- Demonstrated technical leadership in complex HPC or AI system software
- Experience developing or maintaining libfabric providers
- Familiarity with Ultra Ethernet (UEC/UET) specifications
- Experience with RoCEv2, congestion control, or Ethernet-based RDMA deployments
- Experience with cluster-scale benchmarking, profiling, and optimization
- Background with Omni-Path/OPX or other Ethernet-based HPC fabrics