Role Overview

Design, develop, and maintain comprehensive benchmarking frameworks spanning OS, kernel, and application layers.
Profile workloads across CPU, memory, I/O, network, and accelerator (GPU/NPU) subsystems to identify bottlenecks and optimization opportunities.
Establish and own performance baselines across CIQ's product and solutions portfolio.
Leverage AI-assisted tooling and agentic workflows to accelerate profiling, analysis, and root cause identification.
Build and maintain automated performance regression-detection pipelines integrated into CI/CD workflows using Fuzzball.
Identify, triage, and resolve regressions across user space, kernel space, and application layers with urgency and rigor.
Collaborate across engineering teams to root-cause regressions introduced by upstream kernel changes, compiler updates, or library modifications.
Drive proactive performance improvements
not just reactive fixes
to keep CIQ solutions ahead of the competition across every layer of the stack.
Own core operating system performance: kernel subsystem tuning (scheduler, memory management, I/O, networking), system call overhead reduction, and user space library and runtime optimizations.
Identify and implement kernel-level enhancements, including patches, configuration changes, and upstream contributions that yield measurable performance gains for CIQ's customer workloads.
Optimize for AI inference and training workloads, including LLM serving, model parallelism, and accelerator utilization.
Tune performance for HPC workloads, including modeling, simulation, and tightly coupled parallel applications (MPI, OpenMP, etc.).
Optimize general computing and service workloads
web services, databases, messaging systems, and other production software that runs on CIQ's OS platform.
Work at all levels of the stack: compiler flags, kernel parameters, scheduler tuning, NUMA topology, memory allocation, and application-level algorithmic improvements.
Champion an AI-first engineering philosophy
use AI tools, agents, and automation to accelerate your own productivity and the quality of performance insights.
Identify and prioritize optimization opportunities that directly impact AI training throughput and inference latency/cost.
Stay current on state-of-the-art techniques in ML system performance, including quantization, batching strategies, kernel fusion, and hardware-software co-design.
Develop deep expertise in CIQ's Fuzzball platform
its architecture, scheduling, and workload execution model.
Integrate performance benchmarks, regression tests, and user-facing workloads into Fuzzball-based pipelines.
Contribute to the performance characterization of Fuzzball itself, ensuring the platform adds minimal overhead and scales efficiently.
Develop broad familiarity with the full CIQ product portfolio — including Rocky Linux and RLC (and its variants), Fuzzball, Apptainer (formerly Singularity), and Warewulf
understanding how performance considerations span and interconnect across each.
Collaborate deeply with the engineering teams behind each product line to surface, prioritize, and deliver performance improvements that benefit customers across the entire CIQ ecosystem.
Partner with product and customer success teams to translate real-world performance pain points into engineering priorities and measurable outcomes.
Document and communicate findings clearly
from low-level profiling data to executive-level summaries.
Contribute to technical publications, conference presentations, and thought leadership that reinforces CIQ's reputation for performance excellence.

Requirements

A deep, principled understanding of operating system internals
Linux kernel scheduler, memory subsystem, I/O stack, and networking.
Proven experience identifying and resolving performance regressions across kernel and user space in production environments.
Hands-on expertise with profiling and tracing tools: perf, eBPF/bpftrace, Flamegraphs, VTune, Nsight, strace, ftrace, and similar.
Strong background in AI/ML workload performance
including inference optimization (TensorRT, ONNX, vLLM, or similar), training efficiency, and GPU/accelerator utilization.
Experience with HPC workloads: MPI, OpenMP, parallel filesystems, RDMA/InfiniBand, and job schedulers (Slurm, PBS, etc.).
Familiarity with modern AI-first development workflows and comfort using LLM-based tools to accelerate engineering work.
Experience building automated performance testing and regression detection pipelines in CI/CD environments.
Excellent analytical skills
able to form hypotheses, design experiments, and draw actionable conclusions from complex data.
Strong written and verbal communication skills; able to present findings to both deeply technical audiences and business stakeholders.
A collaborative, humble, and always-learning mindset
combined with the confidence to champion performance as a first-class engineering concern.

Tech Stack

Linux

Benefits

Medical, dental, and vision insurance. Flexible paid time off. Employee stock options. Remote work; no travel required for most positions.

Senior/Principal Performance Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits