DigitalOcean is a cutting-edge technology company focused on simplifying cloud and AI for builders. They are seeking a Senior Engineer to join their AI Infrastructure Control Plane team, where the main responsibility will be to architect and develop distributed services for AI workloads while ensuring operational excellence and strategic growth.
Responsibilities:
- Architectural Leadership: Drive the design and development of distributed services within our AI Infrastructure ecosystem, including complex orchestration for LLM inference and hosting services. Your focus will be the control plane orchestration of compute, networking, and storage for AI workloads
- System Design: Create, refine, and assess system design proposals for our high-scale, multi-tenant inference cloud ecosystem, ensuring they meet rigorous standards for availability and resiliency
- Operational Excellence: Lead the operational strategy for critical services, defining SLOs and leveraging advanced observability to maintain platform health in a high-scale environment
- Cross-Functional Collaboration: Partner deeply with Product Management, TPMs, and Engineering Management peers to align technical roadmaps with business priorities
- Strategic Growth: Work on new architecture initiatives that enable fleet optimization and help evolve DigitalOcean into a market leader for AI-native networking and infrastructure
Requirements:
- Deep experience with distributed and cloud services, including messaging systems, databases, and infrastructure as code, observability, and security
- Advanced knowledge of cloud networking (VPCs, Load Balancers), containerization (Kubernetes), and cloud storage (block, object, NFS)
- Proven experience building AI/ML products, specifically focusing on Gen AI platforms, LLM hosting, and inference workflows
- Significant experience running customer-facing, high-availability services across multiple regions
- Experience integrating and building with open-source software and a bias for technical ownership
- Expert proficiency in GoLang or Python and familiarity with gRPC for service-to-service communication
- Deep networking software experience (VLAN, RDMA, Ethernet, L3-L7 network protocols, etc)