My name is Boopathi, and I m a Senior Technical Recruiter at Cloud Destinations LLC.
Please see the below job description and let me know if you are interested in this position.
HPC AI Platform Engineer
Remote
6 Months
Position Overview:
The Senior HPC and AI Platform Engineer will lead the engineering and operations of scalable high-performance computing and artificial intelligence platforms. This role plays a critical part in enabling advanced research and scientific outcomes by delivering reliable, high performing infrastructure across on premises and public cloud environments.
Responsibilities:
- Lead the engineering, design, build, and ongoing operation of scalable HPC and AI platforms.
- Enable HPC and AI infrastructure and user experiences to support research and scientific workloads.
- Collaborate closely with researchers and scientists to optimize system performance and streamline workflows.
- Leverage tooling and automation for orchestration, resource scheduling, data access, and reproducibility.
- Evolve and operate public cloud and on premises environments with a focus on availability and performance for HPC and AI workloads.
- Define, monitor, and analyze infrastructure metrics and resource utilization.
- Partner with Linux engineering teams in areas of automation, AIOps, and observability.
Qualifications:
- Bachelor s degree in computer science, information technology, or a related technical field.
- Five or more years of experience as an HPC platform engineer.
- Experience leading global, large scale infrastructure initiatives.
- Authorized to work in the United States on a full-time basis without employer sponsorship.
Tools and Technologies:
- Hands on experience with HPC platforms and accelerators such as GPU technologies.
- Experience with HPC schedulers such as Altair Grid Engine and Slurm.
- Kubernetes platforms and container technologies including Docker and Apptainer.
- HPC and AI workloads, infrastructure, and cluster architectures.
- Distributed and parallel file systems such as WEKA.
- Advanced Linux command line usage, troubleshooting, and HPC administration.
- DevOps and infrastructure automation tools including GitHub, Chef, Ansible, and Terraform.
- Strong scripting and programming skills using Python or Bash.
- Passion for continuous learning and staying current with HPC and AI infrastructure trends.