IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. The HPC Operations Engineer will provide Tier 2 operational support for the IREN global fleet as part of a 24x7 incident response team, ensuring timely resolution of site and customer impacting events.
Responsibilities:
- Response, triage, and resolution of operational incidents as part of a 24x7 365 response team; Supporting escalations to Tier 3 product operations, when appropriate
- Support the deployment and maintenance of HPC clusters, ensuring they operate effectively and maximize availability
- Manage HPC software components such as Kubernetes, Slurm, cluster management software, and any infrastructure required to operate the HPC environment
- Collaborate with product operations to ensure accurate monitoring and response for our global fleet
- Draft comprehensive documentation, including operational procedures, and best practice guidelines
- Provide technical leadership and training to other team members, fostering an environment of continuous learning and improvement
Requirements:
- Minimum of 3 - 5 years of experience in HPC system architecture with proven expertise in designing, deploying, and managing HPC clusters
- Extensive knowledge of Kubernetes, with a focus on its integration within HPC environments
- Hands-on experience with the Slurm workload manager, or similar
- Familiarity with HPC management tools and software, ensuring efficient system monitoring and troubleshooting
- Proven track record of resolving complex system challenges and enhancing operational performance
- Understanding of cloud platforms and their integration into HPC ecosystems
- Deep knowledge of network and storage solutions commonly used in HPC setups
- A degree or diploma in computer science, engineering, or a combination of education and experience appropriate to the role
- Relevant certifications in Kubernetes, HPC technologies, or system architecture are advantageous