Harnham is a mission-driven AI research organization operating at the cutting edge of large-scale data infrastructure and machine learning. As a Senior Infrastructure Engineer, you'll be responsible for designing and optimizing core data pipelines and backend systems that power advanced AI research at scale.
Responsibilities:
- Design and optimize high-performance data pipelines for distributed training and large-scale storage using tools such as Arrow, DuckDB, LanceDB, BigQuery, and vector databases
- Drive low-level performance optimization across the stack — latency, throughput, GPU utilization, and reliability
- Build and maintain monitoring and observability tooling for data quality, pipeline performance, and experiment tracking
- Optimize distributed AI workloads for efficiency and scale across cloud infrastructure (GCP-primary)
- Architect public-facing data infrastructure capable of serving large, heterogeneous, multimodal datasets to a global research community
- Scope and supervise projects so that interns, PhD students, and post-docs can contribute effectively
- Set engineering standards and best practices across the infrastructure function
- Support technical hiring and help shape the growth of the engineering team
- Act as a bridge between research and engineering — translating prototype workflows into production-grade systems
Requirements:
- 5+ years of backend or infrastructure engineering experience
- Strong Python programming skills (Go is a strong plus; C++, Rust, or CUDA is a bonus)
- Proven experience building and supporting ML/AI infrastructure in production environments
- Hands-on experience with containerization and IaC — Docker, Kubernetes, Terraform
- Experience with cloud platforms (GCP preferred, AWS or Azure also considered)
- Proficiency with high-performance data tools such as DuckDB, Apache Spark, or Delta Lake
- Experience with distributed systems and large-scale data storage
- A backend-first, performance-obsessed mindset
- Experience mentoring junior engineers or researchers and breaking down complex technical problems
- GPU orchestration and large-scale model training experience
- HPC infrastructure experience (Slurm, K8s clusters)
- Familiarity with ML platforms (Vertex AI, SageMaker) and frameworks (PyTorch, JAX)
- Monitoring stack experience (Prometheus, Grafana)
- Background in multimodal, audio, or large-scale scientific data
- Full-stack exposure (React or similar) sufficient to guide others