PythonPyTorchDeep LearningLLMLarge Language ModelsCachingCommunication
About this role
Role Overview
Design, implement, and optimize high‑performance inference pipelines for large language models running on GPUs
Profile and tune model execution across the stack
from scheduler design to kernel fusions and everything in-between
Design and experiment with memory management strategies for improved memory bandwidth optimization and cache efficiency
Innovate and Implement cutting-edge techniques such as Speculative Decoding, Context Caching, and FP8/INT4 quantization to push the boundaries of tokens-per-second-per-watt
Develop and maintain benchmarking and testing systems that quantify latency, utilization, and efficiency
Requirements
Bachelor's, Master's, or higher degree in Computer Engineering, Computer Science, Applied Mathematics, or related computing-focused degree (or equivalent experience)
5+ years of relevant software development experience
Excellent Python programming skills, software design, and software engineering skills
Experience working with deep learning frameworks like PyTorch and HuggingFace
Experience profiling and debugging performance at all levels
Python runtime, PyTorch internals, and GPU utilization metrics
Awareness of the latest developments in LLM architectures and LLM inference techniques
Proactive and able to work without supervision
Excellent written and oral communication skills in English