Vero is an exciting AI infrastructure startup working in partnership with NVIDIA and other key organizations shaping the future of data centers and AI infrastructure. As a Senior Storage Engineer, you will operate, optimize, and scale distributed storage systems for advanced AI infrastructure, ensuring performance and reliability for large-scale GPU workloads.
Responsibilities:
- Operate and support production storage platforms powering large-scale AI workloads, including ETL
- Maintain performance, stability, and reliability across customer environments
- Monitor and tune storage systems to ensure predictable throughput and low latency
- Troubleshoot end-to-end I/O issues across GPU clients, RDMA networks (InfiniBand or RoCE), and storage infrastructure
- Plan and execute upgrades, expansions, and maintenance with minimal disruption
- Support customer onboarding, including storage configuration, namespaces, and access controls
- Run performance validation and benchmarking
- Own incidents, lead root cause analysis, and improve reliability through automation and documentation
Requirements:
- Strong Linux systems experience operating storage infrastructure in production environments
- Hands-on experience with high-performance or distributed storage systems supporting large-scale AI or HPC clusters
- Deep understanding of storage architectures including parallel file systems, file, object, and block storage (e.g. Lustre, VAST, DDN)
- Experience troubleshooting end-to-end I/O performance across clients, RDMA networks (InfiniBand or RoCE), and storage systems
- Experience analyzing and optimizing storage performance, including benchmarking, reliability, and data protection concepts
- ETL and integrations supporting AI/ML workloads