Cohere is a company focused on scaling intelligence to serve humanity through AI systems. They are seeking a Site Reliability Engineer to join their Model Serving team, responsible for developing and operating AI platforms that deliver large language models through API endpoints, ensuring high performance and reliability.
Responsibilities:
- Build self-service systems that automate managing, deploying and operating services
- This includes our custom Kubernetes operators that support language model deployments
- Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems
- Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation
- Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback
- Develop our team through knowledge sharing and an active review process
Requirements:
- 5+ years of engineering experience running production infrastructure at a large scale
- Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
- Experience with Kubernetes dev and production coding and support
- Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
- Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
- Experience in compute/storage/network resource and cost management
- Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork
- The grit and adaptability to solve complex technical challenges that evolve day to day
- Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference
- Strong understanding or working experience with distributed systems
- Experience in Golang, C++ or other languages designed for high-performance scalable servers