Home
Jobs
Saved
Resumes
Staff Software Engineer, ML Infrastructure at Decagon | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Staff Software Engineer, ML Infrastructure
Decagon
Website
LinkedIn
Staff Software Engineer, ML Infrastructure
United States
Full Time
5 hours ago
$300,000 - $430,000 USD
No H1B
Apply Now
Key skills
Node.js
Python
PyTorch
Tensorflow
ML
LLM
TensorFlow
JAX
About this role
Role Overview
Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
Implement and integrate state-of-the-art training algorithms into production pipelines
Own inference architecture and multi-provider routing, including failover and optimization
Research and implement inference optimizations including quantization, speculative decoding, and batching strategies
Lead initiatives to improve latency and cost efficiency across the training and serving stack
Build evaluation and experimentation infrastructure that enables rapid, reliable iteration
Drive technical direction, mentor engineers, and establish best practices for ML infrastructure
Requirements
8+ years building ML infrastructure or production systems at scale
Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow)
Proven track record leading complex, multi-quarter technical projects
Tech Stack
Node.js
Python
PyTorch
Tensorflow
Benefits
Medical, dental, and vision benefits
Take what you need vacation policy
Daily lunches, dinners and snacks in the office to keep you at your best
Apply Now
Home
Jobs
Saved
Resumes