RadixArk is an infrastructure-first company focused on democratizing frontier-level AI infrastructure. They are seeking a Member of Technical Staff (Cluster / Platform) to architect and scale their core compute platform for AI training and inference, emphasizing deep systems engineering and high-performance GPU/TPU clusters.
Responsibilities:
- Architect and scale large AI compute clusters for training and inference
- Design cluster management, scheduling, and resource allocation systems
- Optimize performance, utilization, and reliability of GPU/TPU clusters
- Improve fault tolerance and system resilience at scale
- Drive observability, monitoring, and performance profiling for cluster infrastructure
- Collaborate with ML and systems engineers to support frontier AI workloads
- Lead capacity planning and infrastructure scaling strategies
- Build internal platforms and tooling to improve developer productivity
- Document architecture, operational practices, and reliability strategies
- Contribute to long-term platform vision and technical direction