FirstPrinciples is a research organization building AI infrastructure for discovery in fundamental science. This role involves building and operating the compute foundation for AI-driven scientific discovery, focusing on Kubernetes clusters, Linux systems, and GPU infrastructure.

Responsibilities:

Design, deploy, and operate Kubernetes infrastructure for AI inference, research, and engineering workloads
Set up and manage GPU and HPC-style compute environments, including scheduling, utilization, job management, and node-level troubleshooting
Work with systems such as Kubernetes, Slurm or similar schedulers, container runtimes, GPU drivers & libraries (ie; CUDA), storage systems, and observability tools
Build and manage Linux-based compute environments, including provisioning, networking, storage, monitoring, access control, and lifecycle management
Help architect bare metal, cloud, and hybrid infrastructure across AWS, GCP, Azure, or equivalent platforms
Own the reliability and operational health of infrastructure systems, including monitoring, alerting, incident response, capacity planning, and performance tuning
Improve deployment workflows, automation, configuration management, secrets management, and infrastructure-as-code practices
Partner with ML engineers, researchers, and software engineers to understand workload requirements and translate them into practical infrastructure designs
Evaluate tradeoffs between managed cloud services, self-managed Kubernetes, HPC schedulers, bare metal deployments, and multi-cloud architectures
Build tooling, documentation, runbooks, and operational practices that help the team move quickly without making infrastructure fragile or opaque
Balance speed and robustness, knowing when to prototype quickly and when to harden systems for long-term use

AI & HPC Infrastructure Engineer

Key skills

About this role

Responsibilities: