Voltage Park is your enterprise AI factory, offering scalable compute power and bare metal AI infrastructure. We are seeking a highly skilled Infrastructure Operations Engineer to ensure the stability and performance of our compute, storage, and platform infrastructure, supporting AI/ML training and HPC workloads at scale.
Responsibilities:
- At the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features
- Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases
- Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams
- Participate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary position
Requirements:
- 8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience
- 5+ years experience with AWS
- 2+ years experience with Kubernetes and strong container fundamentals
- 2+ years experience with Terraform and Ansible
- 2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems
- Experience working in a Slack-first, asynchronous remote work environment
- Experience with monitoring systems (Prometheus, ELK stack)
- Familiarity with the gitops workflow
- Software development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs together
- Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband
- Experience building and delivering complex systems
- Effective at navigating tradeoffs between design, risk, cost, and outcomes
- Comfortable with navigating ambiguity
- Strong written and oral communication
- Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware
- Experience with GPU servers, both in bare metal form or under virtualization
- Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendors
- Experience with VAST storage systems