Hark is an artificial intelligence company building advanced, personalized intelligence. They are seeking a Member of Technical Staff, Infrastructure Compute to lead and manage large-scale GPU computing clusters for AI training and deployment workloads.

Responsibilities:

Design, implement, and maintain Infrastructure as Code (IaC) best practices to enable repeatable, auditable, and scalable cluster provisioning
Enhance and harden CI/CD deployment pipelines to ensure robust, secure, and low-latency model service delivery across production environments
Own and evolve stable training infrastructure operating at the scale of 10,000+ GPUs, including job scheduling, fault tolerance, and network fabric optimization
Partner closely with ML researchers and engineers to understand compute bottlenecks and translate them into infrastructure improvements
Monitor system health, define SLOs, and lead incident response for critical training and inference workloads
Drive capacity planning, cost efficiency initiatives, and hardware lifecycle management across the GPU fleet
Contribute to internal tooling and platform abstractions that improve developer experience for teams consuming compute resources

Infrastructure, Large-scale Training

Key skills

About this role

Responsibilities: