Leads the day-to-day operational support for Cluster, Storage, HPC and Cloud infrastructures.
Builds and tests Cloud/HPC or AI in-house and onsite testing, deployment, and platforms to meet customer's requirement.
Documents complex test procedures and troubleshooting procedures related to servers/networks/clusters software and hardware.
Troubleshoot hardware and software issues. Provide fixes in a timely manner.
Deploy cluster/storage infrastructures and perform the tests accordingly.
Conduct tests and benchmarks against server hardware, storage, network, applications, HPC and Machine Learning workflows.
Collect, visualize, and analyze test and benchmark results.
Write technical documentation including test reports and standard operating procedure (SOP).
Coordinate with cross-functional teams to ensure smooth workflow and timely project delivery.
Manage project budgets, control costs, and drive continuous improvement initiatives.
Maintain clear communication with stakeholders and address customer feedback for future enhancements.
Requirements
Bachelor's degree in Computer Science or equivalent work experience preferred.
8+ years of proven experience in a HPC/AI or Cloud/Network management.
Strong leadership skills with the ability to motivate and manage teams effectively.
In-depth knowledge of Cloud/HPC deployment and testing.
Excellent project management skills, including the ability to prioritize tasks, manage multiple projects simultaneously, and meet deadlines.
Strong problem-solving and decision-making abilities, with a proactive approach to identifying and resolving issues.
Excellent communication skills, both verbal and written, with the ability to collaborate and build strong relationships with stakeholders at all levels.
It's a plus if you have CCNA, AWS, COA or RHCE certificates.
Positive attitude, desire to learn, time management, and strong interpersonal skills are a plus!