Troubleshoot incoming support requests in a large-scale HPC environment
Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation
Ensure compute servers are running correct Operating System and configuration
Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency
Collaborate with specialist teams to drive issues to closure
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure
Directly contribute to the overall quality and improve time to market for our next generation chips.
Requirements
BS in Computer Science or similar degree or equivalent experience
2+ years of experience
Proficient in administering Centos/RHEL Linux distributions
Understanding of container technologies like Docker
Proficiency in Python and UNIX scripting languages such as bash
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions
Solid understanding of cluster configuration management tools such as Ansible
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.