Clockwork Systems, Inc. is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability for GPU clusters. The role involves designing and building scalable backend systems for AI and GPU cluster observability, focusing on high-performance distributed systems for telemetry ingestion and data processing.
Responsibilities:
- Design and build scalable backend systems for metric collection, processing, and analysis
- Develop robust methods to detect complex infrastructure issues that impact AI workloads
- Build large distributed systems running in production environments
- Collaborate across teams to deliver reliable, performant, and maintainable systems