TRM Labs is a company dedicated to building a safer world through AI-powered intelligence solutions. The Senior Software Engineer, ML Infrastructure will design and operate scalable GPU-backed infrastructure that supports TRM's AI systems, collaborating with various teams to ensure effective model deployment and optimization.

Responsibilities:

Design and operate GPU cluster infrastructure
Build and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and users
Optimize high-throughput inference
Implement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloads
Enable distributed inference strategies
Support and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale models
Implement model optimization and compilation workflows
Integrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference cost
Schedule heterogeneous workloads
Design systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demand
Build observability into ML infrastructure
Instrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliability
Partner across engineering teams
Work closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available services

Requirements:

Bachelor's degree (or equivalent) in Computer Science or related field
5+ years of experience building and operating distributed systems or infrastructure in production environments
Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)
Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and cost
Experience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum
Experience optimizing GPU load, memory efficiency, and performance bottlenecks in production systems
Familiarity with distributed inference strategies including model parallelism and tensor parallelism
Experience working with Kubernetes or equivalent orchestration systems in cloud environments
Adaptable. Goals can change fast. You anticipate and react quickly
Autonomous. You own what you work on. You move fast and get things done
Excellent communication. You communicate complex ideas effectively to both technical and non-technical audiences, verbally and in writing
Collaborative. You work effectively in a cross-functional team and with people at all levels in an organization
Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus
CUDA familiarity and experience debugging GPU-related issues is a plus

Machine Learning Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: