PulseRise Technologies is seeking a highly skilled Systems Engineer specializing in High Performance Computing (HPC) to support, maintain, and optimize their HPC infrastructure. The ideal candidate will have deep technical expertise and hands-on experience with HPC environments, focusing on performance engineering and systems operations.
Responsibilities:
- Incident Management: Respond to, diagnose, and resolve HPC-related incidents to ensure system stability and minimize downtime
- Service Request Management: Process and fulfill service requests related to HPC resources, tooling, and services
- Troubleshooting: Investigate and resolve complex technical issues across HPC clusters, applications, networking, and performance workflows
- Testing & Validation: Develop, execute, and document test plans to validate system reliability, scalability, and performance
- Documentation: Create and maintain detailed documentation on system architecture, configurations, workflows, and optimizations
- Manage, monitor, and optimize HPC clusters, job scheduling systems, and related infrastructure
- Analyze performance bottlenecks and apply optimization techniques across compute, memory, and networking layers
- Support software development, integration, and deployment workflows within HPC environments
Requirements:
- Minimum 3 years of experience in software development and/or systems engineering with a strong focus on HPC environments
- Expertise in Linux operating systems, specifically Red Hat Enterprise Linux (RHEL)
- Strong programming/scripting skills: C, C++, Python, Bash, Ansible
- Hands-on experience with parallel computing frameworks: MPI, OpenMP, CUDA
- Solid knowledge of computer architecture, performance tuning, and system optimization
- Experience managing HPC clusters, including job schedulers (e.g., Slurm, PBS, LSF)
- Strong networking knowledge, particularly InfiniBand
- Understanding of ITIL best practices, especially: Incident Management, Service Management, Process Optimization
- Strong analytical and problem-solving capabilities
- Ability to work in distributed, remote teams
- Clear communication and documentation skills
- Proactive, structured, and solution-oriented mindset
- German: as a plus