Reflection is a company focused on building open superintelligence and making it accessible to all. They are seeking a member of the Compute Platform team to manage and maintain their compute layer, focusing on cluster management, platform engineering, and monitoring for performance benchmarking.
Responsibilities:
- Cluster Management: Build and maintain tools for the automatic remediation, topology-aware scheduling, capacity planning and rapid hardware debugging
- Platform Engineering: Design and iterate on our cluster management stack for workloads across large, multi-GPU fleets
- Monitoring & Observability: Implement comprehensive cluster-wide monitoring, focusing on durability and active performance benchmarking
- Roadmap Execution: Prepare the infrastructure for next-generation GPU deployments and increasingly larger cluster sizes. Long-term, you will help own multi-cloud storage, petabyte-scale data replication, and GPU-to-GPU network performance