Vero is an exciting AI infrastructure startup that collaborates closely with NVIDIA and other key organizations in the field. The Senior Storage Engineer will operate, optimize, and scale distributed storage systems to support advanced AI workloads, ensuring performance and reliability for large-scale GPU operations.
Responsibilities:
- Operate and support production storage platforms powering large-scale AI workloads
- Maintain performance, stability, and reliability across customer environments
- Monitor and tune storage systems to ensure predictable throughput and low latency
- Troubleshoot end-to-end I/O issues across GPU clients, RDMA networks (InfiniBand or RoCE), and storage infrastructure
- Plan and execute upgrades, expansions, and maintenance with minimal disruption
- Support customer onboarding, including storage configuration, namespaces, and access controls
- Run performance validation and benchmarking
- Own incidents, lead root cause analysis, and improve reliability through automation and documentation
Requirements:
- Strong Linux systems experience operating storage infrastructure in production environments
- Hands-on experience with high-performance or distributed storage systems supporting large-scale AI and HPC clusters
- Deep understanding of storage architectures including parallel file systems, file, object, and block storage (VAST, DDN, Weka, Lustre)
- Experience troubleshooting end-to-end I/O performance across clients, RDMA networks (InfiniBand or RoCE), and storage systems
- Experience analyzing and optimizing storage performance, including benchmarking, reliability, and data protection concepts
- ETL and integrations supporting AI/ML workloads