TRM Labs is a company dedicated to building a safer world through AI-powered intelligence solutions. The Senior Software Engineer, ML Infrastructure will design and operate scalable GPU-backed infrastructure that supports TRM's AI systems, collaborating with various teams to ensure effective model deployment and optimization.
Responsibilities:
- Design and operate GPU cluster infrastructure
- Build and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and users
- Optimize high-throughput inference
- Implement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloads
- Enable distributed inference strategies
- Support and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale models
- Implement model optimization and compilation workflows
- Integrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference cost
- Schedule heterogeneous workloads
- Design systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demand
- Build observability into ML infrastructure
- Instrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliability
- Partner across engineering teams
- Work closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available services
Requirements:
- Bachelor's degree (or equivalent) in Computer Science or related field
- 5+ years of experience building and operating distributed systems or infrastructure in production environments
- Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)
- Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and cost
- Experience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum
- Experience optimizing GPU load, memory efficiency, and performance bottlenecks in production systems
- Familiarity with distributed inference strategies including model parallelism and tensor parallelism
- Experience working with Kubernetes or equivalent orchestration systems in cloud environments
- Adaptable. Goals can change fast. You anticipate and react quickly
- Autonomous. You own what you work on. You move fast and get things done
- Excellent communication. You communicate complex ideas effectively to both technical and non-technical audiences, verbally and in writing
- Collaborative. You work effectively in a cross-functional team and with people at all levels in an organization
- Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus
- CUDA familiarity and experience debugging GPU-related issues is a plus