DigitalOcean is a cutting-edge technology company focused on simplifying cloud and AI for builders. They are seeking a Senior Engineer to lead the design and development of distributed services for their AI Infrastructure, ensuring high availability and operational excellence.
Responsibilities:
- Drive the design and development of distributed services within our AI Infrastructure ecosystem, including complex orchestration for LLM inference and hosting services
- Create, refine, and assess system design proposals for our high-scale, multi-tenant inference cloud ecosystem, ensuring they meet rigorous standards for availability and resiliency
- Lead the operational strategy for critical services, defining SLOs and leveraging advanced observability to maintain platform health in a high-scale environment
- Partner deeply with Product Management, TPMs, and Engineering Management peers to align technical roadmaps with business priorities
- Work on new architecture initiatives that enable fleet optimization and help evolve DigitalOcean into a market leader for AI-native networking and infrastructure
Requirements:
- Deep experience with distributed and cloud services, including messaging systems, databases, and infrastructure as code, observability, and security
- Advanced knowledge of cloud networking (VPCs, Load Balancers), containerization (Kubernetes), and cloud storage (block, object, NFS)
- Proven experience building AI/ML products, specifically focusing on Gen AI platforms, LLM hosting, and inference workflows
- Significant experience running customer-facing, high-availability services across multiple regions
- Experience integrating and building with open-source software and a bias for technical ownership
- Expert proficiency in GoLang or Python and familiarity with gRPC for service-to-service communication
- Deep networking software experience (VLAN, RDMA, Ethernet, L3-L7 network protocols, etc)