Reflection is a company focused on building open superintelligence and making it accessible to all. They are seeking a member of the Compute Platform team to manage and maintain their compute layer, focusing on cluster management, platform engineering, and monitoring for performance benchmarking.

Responsibilities:

Cluster Management: Build and maintain tools for the automatic remediation, topology-aware scheduling, capacity planning and rapid hardware debugging
Platform Engineering: Design and iterate on our cluster management stack for workloads across large, multi-GPU fleets
Monitoring & Observability: Implement comprehensive cluster-wide monitoring, focusing on durability and active performance benchmarking
Roadmap Execution: Prepare the infrastructure for next-generation GPU deployments and increasingly larger cluster sizes. Long-term, you will help own multi-cloud storage, petabyte-scale data replication, and GPU-to-GPU network performance

Member of Technical Staff - Compute Platform

Key skills

About this role

Responsibilities: