Hark is an artificial intelligence company building advanced, personalized intelligence. They are seeking a Member of Technical Staff, Infrastructure Compute to lead and manage large-scale GPU computing clusters for AI training and deployment workloads.
Responsibilities:
- Design, implement, and maintain Infrastructure as Code (IaC) best practices to enable repeatable, auditable, and scalable cluster provisioning
- Enhance and harden CI/CD deployment pipelines to ensure robust, secure, and low-latency model service delivery across production environments
- Own and evolve stable training infrastructure operating at the scale of 10,000+ GPUs, including job scheduling, fault tolerance, and network fabric optimization
- Partner closely with ML researchers and engineers to understand compute bottlenecks and translate them into infrastructure improvements
- Monitor system health, define SLOs, and lead incident response for critical training and inference workloads
- Drive capacity planning, cost efficiency initiatives, and hardware lifecycle management across the GPU fleet
- Contribute to internal tooling and platform abstractions that improve developer experience for teams consuming compute resources