Harnham is a well-funded AI research company focused on building next-generation multimodal models for media and interactive experiences. The Research Engineer role is deeply technical, aimed at optimizing large-scale AI systems for efficiency and real-time performance, involving work across the stack from GPU kernels to distributed training systems.
Responsibilities:
- Optimize training throughput across large GPU clusters, improving efficiency and utilization
- Implement techniques such as mixed precision (FP8, BF16), memory-efficient attention, and checkpointing
- Design and scale distributed training systems (tensor parallelism, FSDP, multi-node setups)
- Profile and optimize inference pipelines for real-time multimodal generation
- Improve latency through CUDA graphing, KV cache optimization, and operator fusion
- Contribute across the stack, from kernel-level optimization to system-level architecture
Requirements:
- 4+ years of experience in systems engineering, ML infrastructure, or performance optimization
- Strong experience with GPU programming (CUDA, Triton, or similar)
- Experience with distributed systems and large-scale training (NCCL, model parallelism)
- Familiarity with ML framework internals such as PyTorch or JAX
- Experience with mixed or low-precision techniques (FP8, INT8, BF16)
- Proven experience building and operating scalable, fault-tolerant training systems
- Strong interest in pushing the limits of performance for cutting-edge AI systems
- Experience with compiler optimizations or model compilation (e.g., PyTorch compile)
- Background working on large multimodal or generative models
- Exposure to real-time inference systems