Crusoe is on a mission to accelerate the abundance of energy and intelligence through sustainable technology. As a Senior Cloud Support Engineer, you'll empower customers to leverage Crusoe Cloud's low-cost GPU compute power, providing exceptional technical support and ensuring seamless utilization of the technology for groundbreaking advancements.
Responsibilities:
- Provide exceptional technical support to customers via Zendesk, meeting SLAs and maintaining high CSAT (95%+)
- Participate in a 24/7 on-call rotation to ensure timely resolution of critical issues
- Diagnose and resolve issues related to VMs, hardware failures, and scaling tests using CLI and internal tools
- Manage alert triage, prepare for maintenance windows, and conduct node delivery testing
- Work closely with SRE, Networking, and Storage teams from initial triage to root cause analysis (RCA) delivery
- Adhere to global team collaboration and handoff processes for ticketing and on-call procedures
- Develop onboarding/training materials, knowledge base documentation, and standard operating procedures (SOPs)
Requirements:
- Bachelor's degree in IT, Computer Science, Engineering, or a related field, or 4+ years of equivalent technical experience
- Strong command-line interface (CLI) skills in Linux environments
- Proficiency with Git for code management and collaboration
- 5+ years of experience in a customer support role, ideally within cloud, storage, or networking environments
- Experience with container orchestration (e.g., Kubernetes), workload management (e.g., Slurm, Terraform), and monitoring tools (e.g., Grafana)
- Familiarity with other public cloud platforms (e.g., AWS, Azure, GCP)
- Excellent communication and customer service skills, including the ability to prioritize competing escalations
- Understanding of HPC technologies such as Infiniband, RDMA, RoCE, and Software Defined Networking (SDN)
- Provide exceptional technical support to customers via Zendesk, meeting SLAs and maintaining high CSAT (95%+)
- Participate in a 24/7 on-call rotation to ensure timely resolution of critical issues
- Diagnose and resolve issues related to VMs, hardware failures, and scaling tests using CLI and internal tools
- Manage alert triage, prepare for maintenance windows, and conduct node delivery testing
- Work closely with SRE, Networking, and Storage teams from initial triage to root cause analysis (RCA) delivery
- Adhere to global team collaboration and handoff processes for ticketing and on-call procedures
- Develop onboarding/training materials, knowledge base documentation, and standard operating procedures (SOPs)
- CKA, CKAD, CKS, KCNA, AWS Machine Learning - Specialty, Data Analytics - Specialty, Solutions Architect - Professional, Developer - Associate, NVIDIA AI Infrastructure and Operations, Generative AI and LLMs, Generative AI Multi-modal, Infiniband, Linux Foundation IT Associate, System Administrator
- Deep understanding of specific cloud platforms and services
- Experience with automation tools and scripting languages
- Demonstrated ability to analyze complex technical issues and develop effective solutions
- Proven ability to mentor, train, and onboard colleagues
- A strong interest in contributing to a more sustainable future through technology