Own our AWS infrastructure end-to-end and actively shape how it evolves; building, not just maintaining.
Reduce friction in the deployment pipeline so developers can ship without infrastructure blockers.
Harden systems with intention: lock down IAM roles, container images, and authentication flows in ways that reflect a clear understanding of where the real risks are.
Implement monitoring and alerting that catches production issues before users notice them.
Make deployments faster to roll out, easier to roll back, and less prone to failure.
Lead incident response and post-mortems when necessary.
Make GPU clusters and other infrastructure invisible to the researchers running it.
Own CUDA compatibility and driver versions across heterogeneous GPU clusters.
Build standardized SLURM job submission workflows that researchers can use without help.
Package and containerize Python simulation code for reproducible execution.
Monitor job health across utilization, cost, and runtime efficiency.
Requirements
Experience: 5+ years in Platform Engineering, DevOps, or SRE roles.
Production AWS experience: Built and maintained systems on ECS/EKS, managed multi-account networking (VPCs, security groups), and dealt with real-world infrastructure complexity.
Infrastructure as Code: You've written and maintained Terraform (or Pulumi/CDK) in production, including applying ongoing changes as requirements evolved.
CI/CD: Improved build pipelines in production (reduced build times, increased reliability, made deployments easier to debug), including experience with GitLab CI, GitHub Actions, or equivalent.
GPU/HPC experience: Supported GPU workloads in production environments, including code optimization, CUDA debugging, and job scheduler setup.
Background in scientific computing, research infrastructure, ML platforms, or early-stage startups (especially research computing vendors).
Security & compliance experience: You've implemented auth systems (Auth0/Okta), managed encryption (KMS), or worked on FedRAMP/compliance-driven infrastructure. FedRAMP experience is a strong plus.
Exposure to quantum computing SDKs (Qiskit, Cirq, PennyLane) or hybrid classical-quantum workflows is a plus, but not required; genuine interest in quantum computing matters more than prior exposure.