architect high-performance computing solutions from scratch
design/optimize all aspects (Compute, Memory, Networking, Storage) for better cost of Ownership
responsible for designing HPC infrastructure solutions, including compute, networking, storage, and workload management components
work closely with cross-functional teams, including Hardware, Software, product management, and business stakeholders
create and maintain detailed system architecture diagrams and specifications
evaluate and select appropriate hardware and software components for HPC environments
Install, configure, and maintain HPC systems, including hardware, software, and networking components
develop and implement automation scripts for system management and deployment
subject Matter expert to unblock dependent teams in the HPC domain
develop system benchmarks, profile systems to understand bottlenecks, optimize workflows and processes to improve cost of ownership
identify and mitigate technical risks and issues throughout the HPC development life cycle
ensure that Compute Cluster is resilient, reliable, and maintainable
stay abreast of the latest HPC technologies, including Hardware, Software and Networking Solutions
focus on understand the compute workload and design HPC cluster with right combination of Nodes, CPU/GPU, Memory, Interconnects and storage to have optimum performance at minimum cost of Ownership
Requirements
In-depth experience with Linux System administration and Hardware/Software Configuration
Strong knowledge of HPC technologies including cluster computing, high speed interconnects (InfiniBand, RoCE), parallel filesystems (Lustre, GPFS, BeeGFS etc)
Experience in creating, maintaining Operating System images with different installation and boot schemes
Extremely good with automation tools like Ansible, Chef, Salt-Stack and Scripting languages (Python and Bash)
Experience in Creating, maintaining Storage Solutions with different RAID configuration
Ability to design storage solution for different IOPS, Access patterns (Random vs Sequential RW) and tune storage and filesystems for better performance
Good knowledge of Networking concepts including IP addressing, routing, protocols and Switch configuration for RDMA, VLAN configuration, network bonding etc
Good Knowledge Virtualization, Hardware and Software Hypervisors
Good knowledge of containerization technologies like docker, singularity
Experience in Software Defined Networking and Storage
Experience in setting-up remote management protocols like IPMI, Red fish etc.
Experience in setting-up and using monitoring systems like Prometheus, Grafana
Experience System profiling and custom tuning for target workload for higher performance and low cost of ownership
Very good written and verbal communication skills
Very good in Technical documentation meant to serve as manuals for non-experts in the field
Experience in HPC Cluster management and Work-load orchestration software (e.g. SLURM, Torque, LSF)
Experience in Setting-up Deep-learning training/inference solutions
Experience in Private cloud infrastructure like Kubernetes, OpenStack, CloudStack etc.
Experience in Distributed High Performance Computing and Parallel programming frameworks
Good knowledge of Low-latency and high-throughput data transfer technologies (RDMA on RoCE, InfiniBand)
Tech Stack
Ansible
Chef
Cloud
Docker
Grafana
Kubernetes
Linux
OpenStack
Prometheus
Python
SaltStack
Benefits
supportive work culture that encourages you to learn, develop, and grow your career
commitment to providing programs and support that encourage personal and professional growth