FirstPrinciples is a research organization building AI infrastructure for discovery in fundamental science. This role involves building and operating the compute foundation for AI-driven scientific discovery, focusing on Kubernetes clusters, Linux systems, and GPU infrastructure.
Responsibilities:
- Design, deploy, and operate Kubernetes infrastructure for AI inference, research, and engineering workloads
- Set up and manage GPU and HPC-style compute environments, including scheduling, utilization, job management, and node-level troubleshooting
- Work with systems such as Kubernetes, Slurm or similar schedulers, container runtimes, GPU drivers & libraries (ie; CUDA), storage systems, and observability tools
- Build and manage Linux-based compute environments, including provisioning, networking, storage, monitoring, access control, and lifecycle management
- Help architect bare metal, cloud, and hybrid infrastructure across AWS, GCP, Azure, or equivalent platforms
- Own the reliability and operational health of infrastructure systems, including monitoring, alerting, incident response, capacity planning, and performance tuning
- Improve deployment workflows, automation, configuration management, secrets management, and infrastructure-as-code practices
- Partner with ML engineers, researchers, and software engineers to understand workload requirements and translate them into practical infrastructure designs
- Evaluate tradeoffs between managed cloud services, self-managed Kubernetes, HPC schedulers, bare metal deployments, and multi-cloud architectures
- Build tooling, documentation, runbooks, and operational practices that help the team move quickly without making infrastructure fragile or opaque
- Balance speed and robustness, knowing when to prototype quickly and when to harden systems for long-term use