Role Overview
- Design, develop, and maintain comprehensive benchmarking frameworks spanning OS, kernel, and application layers.
- Profile workloads across CPU, memory, I/O, network, and accelerator (GPU/NPU) subsystems to identify bottlenecks and optimization opportunities.
- Establish and own performance baselines across CIQ's product and solutions portfolio.
- Leverage AI-assisted tooling and agentic workflows to accelerate profiling, analysis, and root cause identification.
- Build and maintain automated performance regression-detection pipelines integrated into CI/CD workflows using Fuzzball.
- Identify, triage, and resolve regressions across user space, kernel space, and application layers with urgency and rigor.
- Collaborate across engineering teams to root-cause regressions introduced by upstream kernel changes, compiler updates, or library modifications.
- Drive proactive performance improvements
- not just reactive fixes
- to keep CIQ solutions ahead of the competition across every layer of the stack.
- Own core operating system performance: kernel subsystem tuning (scheduler, memory management, I/O, networking), system call overhead reduction, and user space library and runtime optimizations.
- Identify and implement kernel-level enhancements, including patches, configuration changes, and upstream contributions that yield measurable performance gains for CIQ's customer workloads.
- Optimize for AI inference and training workloads, including LLM serving, model parallelism, and accelerator utilization.
- Tune performance for HPC workloads, including modeling, simulation, and tightly coupled parallel applications (MPI, OpenMP, etc.).
- Optimize general computing and service workloads
- web services, databases, messaging systems, and other production software that runs on CIQ's OS platform.
- Work at all levels of the stack: compiler flags, kernel parameters, scheduler tuning, NUMA topology, memory allocation, and application-level algorithmic improvements.
- Champion an AI-first engineering philosophy
- use AI tools, agents, and automation to accelerate your own productivity and the quality of performance insights.
- Identify and prioritize optimization opportunities that directly impact AI training throughput and inference latency/cost.
- Stay current on state-of-the-art techniques in ML system performance, including quantization, batching strategies, kernel fusion, and hardware-software co-design.
- Develop deep expertise in CIQ's Fuzzball platform
- its architecture, scheduling, and workload execution model.
- Integrate performance benchmarks, regression tests, and user-facing workloads into Fuzzball-based pipelines.
- Contribute to the performance characterization of Fuzzball itself, ensuring the platform adds minimal overhead and scales efficiently.
- Develop broad familiarity with the full CIQ product portfolio — including Rocky Linux and RLC (and its variants), Fuzzball, Apptainer (formerly Singularity), and Warewulf
- understanding how performance considerations span and interconnect across each.
- Collaborate deeply with the engineering teams behind each product line to surface, prioritize, and deliver performance improvements that benefit customers across the entire CIQ ecosystem.
- Partner with product and customer success teams to translate real-world performance pain points into engineering priorities and measurable outcomes.
- Document and communicate findings clearly
- from low-level profiling data to executive-level summaries.
- Contribute to technical publications, conference presentations, and thought leadership that reinforces CIQ's reputation for performance excellence.
Requirements
- A deep, principled understanding of operating system internals
- Linux kernel scheduler, memory subsystem, I/O stack, and networking.
- Proven experience identifying and resolving performance regressions across kernel and user space in production environments.
- Hands-on expertise with profiling and tracing tools: perf, eBPF/bpftrace, Flamegraphs, VTune, Nsight, strace, ftrace, and similar.
- Strong background in AI/ML workload performance
- including inference optimization (TensorRT, ONNX, vLLM, or similar), training efficiency, and GPU/accelerator utilization.
- Experience with HPC workloads: MPI, OpenMP, parallel filesystems, RDMA/InfiniBand, and job schedulers (Slurm, PBS, etc.).
- Familiarity with modern AI-first development workflows and comfort using LLM-based tools to accelerate engineering work.
- Experience building automated performance testing and regression detection pipelines in CI/CD environments.
- Excellent analytical skills
- able to form hypotheses, design experiments, and draw actionable conclusions from complex data.
- Strong written and verbal communication skills; able to present findings to both deeply technical audiences and business stakeholders.
- A collaborative, humble, and always-learning mindset
- combined with the confidence to champion performance as a first-class engineering concern.
Tech Stack
Benefits
Medical, dental, and vision insurance.
Flexible paid time off.
Employee stock options.
Remote work; no travel required for most positions.