Clockwork Systems, Inc. is pioneering a software-driven approach to AI fabrics to enhance GPU cluster utilization. The role involves designing and implementing low-level systems software for GPU clusters, ensuring reliability and efficiency in large-scale GPU training.
Responsibilities:
- Design and implement low-level systems software for GPU clusters
- Work with internals of frameworks like PyTorch, NCCL, CUDA runtime—not as a user, but modifying and extending them
- Build components that make large-scale GPU training more reliable and efficient
- Debug complex distributed/concurrent systems where failures are subtle and non-deterministic
- Own systems end-to-end: from design through production