IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. The HPC Operations Engineer will provide Tier 2 operational support for the IREN global fleet as part of a 24x7 incident response team, ensuring timely resolution of site and customer impacting events.

Responsibilities:

Response, triage, and resolution of operational incidents as part of a 24x7 365 response team; Supporting escalations to Tier 3 product operations, when appropriate
Support the deployment and maintenance of HPC clusters, ensuring they operate effectively and maximize availability
Manage HPC software components such as Kubernetes, Slurm, cluster management software, and any infrastructure required to operate the HPC environment
Collaborate with product operations to ensure accurate monitoring and response for our global fleet
Draft comprehensive documentation, including operational procedures, and best practice guidelines
Provide technical leadership and training to other team members, fostering an environment of continuous learning and improvement

Requirements:

Minimum of 3 - 5 years of experience in HPC system architecture with proven expertise in designing, deploying, and managing HPC clusters
Extensive knowledge of Kubernetes, with a focus on its integration within HPC environments
Hands-on experience with the Slurm workload manager, or similar
Familiarity with HPC management tools and software, ensuring efficient system monitoring and troubleshooting
Proven track record of resolving complex system challenges and enhancing operational performance
Understanding of cloud platforms and their integration into HPC ecosystems
Deep knowledge of network and storage solutions commonly used in HPC setups
A degree or diploma in computer science, engineering, or a combination of education and experience appropriate to the role
Relevant certifications in Kubernetes, HPC technologies, or system architecture are advantageous

HPC Operations Engineer

Key skills

About this role

Responsibilities:

Requirements: