Own the performance and reliability characteristics of AI systems deployed in customer environments
Design, build, and operate low-latency AI services—including real-time voice and interaction pipelines—as well as large-scale batch processing workflows that execute complex AI workloads reliably
AI Production Engineers are the escalation point for performance and reliability risk, and have veto power on launches that violate production constraints
Deeply involved in system design, implementation, and operation, investigating performance bottlenecks, failure modes, and scaling limits across AI pipelines, APIs, orchestration layers, and infrastructure
Design and evolve observability systems—metrics, logs, tracing, alerts—that make AI behavior understandable and actionable in production
Work directly with Forward Deployed AI Engineers, Product Engineers, and Architects to ensure that production constraints meaningfully shape system design
Step in on high-risk or high-impact issues, debug live systems, and harden AI services so they can operate continuously under real-world load
Help turn one-off production solutions into reusable patterns and platform capabilities, raising the overall production bar for Distyl’s AI systems over time
Requirements
3+ years of software engineering experience
Deep Production Engineering Experience: Built and operated high-scale systems—low-latency APIs, streaming pipelines, real-time services, or large batch processing systems—and can reason deeply about performance, throughput, and reliability. Experience with real-time voice systems is a strong plus
Strong Systems and Backend Fundamentals: Write high-quality production code and understand distributed systems concepts such as concurrency, fault tolerance, backpressure, and graceful degradation. You are comfortable optimizing systems under tight latency and throughput constraints
Operational Excellence Mindset: Treat observability, instrumentation, and incident response as first-class concerns. Logging, metrics, tracing, alerting, and on-call readiness are integral to how you design and operate systems
Ownership of AI Systems in Production: Take responsibility for AI systems end-to-end—design, deployment, monitoring, and ongoing health. When something breaks, you care about understanding why, fixing it properly, and preventing recurrence
AI-Native Working Style: Actively use AI tools to debug systems, analyze performance data, explore designs, and automate operational workflows
Tech Stack
Distributed Systems
Benefits
100% covered medical, dental, and vision for employees and dependents
401(k) with additional perks (e.g., commuter benefits, in‑office lunch)
Access to state‑of‑the‑art models, generous usage of modern AI tools, and real‑world business problems
Ownership of high‑impact projects across top enterprises
A mission‑driven, fast‑moving culture that prizes curiosity, pragmatism, and excellence