General Dynamics Information Technology is a global technology and professional services company that delivers consulting and technology services. They are seeking a Systems Engineer Principal to support the operational availability of High Performance Computing clusters for the National Weather Service, ensuring effective service delivery and customer satisfaction.
Responsibilities:
- Lead/Manage/Support the day-day operations, sustainment, HPC services delivery, and incremental enhancements of two, geographically separated HPC clusters that are GDIT contractor owned and contractor operated (COCO) and used exclusively for WCOSS
- Collaborate with the GDIT WCOSS team as a senior-level HPC functional expert addressing intricate and multifaceted HPC challenges by providing innovative ideas, solutions, and resolution for customer requests, issues, and improvement efficiencies on a continuous basis
- Drive and prioritize resource utilization towards continuously improving customer satisfaction with GDIT's HPC service delivery and exceeding the contract service level metrics of uptime, availability, performance, stability, and on-time product delivery
- Utilize past experience, team collaboration, system management and troubleshooting applications, and ingenuity to support customer operations while working on systems that range in capacity from 1000-3000+ nodes and 100's of PB storage per system
Requirements:
- Bachelor of Arts/Bachelor of Science
- 8+ years of related experience
- Highly proficient with Linux (RockyOS, SLES, etc)
- Scripting in Python, Perl, or Bash
- Networking concepts and technology such as Ethernet, InfiniBand and Slingshot, TCP/IP networking, basic routing, and network services
- Programming in Python, C/C++, or Fortran
- Administrating PBSpro, SLURM or other batch systems in an HPC cluster
- System performance monitoring and tuning in an HPC cluster environment (e.g., Opensearch, Grafana, Prometheus)
- Must complete a satisfactory background investigation
- US citizenship required
- Expected to perform as individual SME contributor, functional lead, or project/task leader responsible for work product delivery
- Extensive experience in troubleshooting, diagnosing and repairing hardware failures to component level on servers
- Coordinating with vendors to resolve hardware and software problems
- Minimal travel required for onsite work, team collaboration, training, and customer interaction