Arlo Technologies, Inc. is dedicated to creating innovative solutions for security technology. They are seeking a Staff Software Engineer to join the Nexus team, focusing on building and expanding agent capabilities for their next-generation chat experience, which integrates with various systems and enhances user interaction through AI-powered agents.
Responsibilities:
- Design and ship new agent capabilities for Nexus — new tools, skills, integrations, and conversational flows that meaningfully expand what users can accomplish through chat
- Build and own production-grade Python services (FastAPI, async patterns) that power Nexus's agent runtime, tool execution, and orchestration logic
- Extend our orchestration layer (LangGraph / LangChain or equivalent) with new agent topologies, routing logic, and tool-use patterns
- Design tool-use and function-calling interfaces — including MCP servers — that let Nexus safely interact with Arlo platform APIs, device telemetry, and partner systems
- Build the evals and observability that make agent behavior measurable: offline test suites, online quality metrics, trace tooling, regression detection, and dashboards engineers and PMs actually use
- Own the testing strategy for AI experiences — design and build the test harnesses, golden datasets, scenario suites, adversarial/red-team tests, and CI gates that catch agent regressions before they reach users. Define what "good" looks like for conversational quality, tool-use correctness, and task completion
- Partner closely with product, design, and platform teams to turn user needs into shipped agent features — and bring engineering judgment to scoping, sequencing, and tradeoffs
- Set technical direction for agent development practices at Arlo: patterns, frameworks, code review standards, and the playbook other engineers follow when they build on Nexus
- Mentor mid and senior engineers on LLM systems, prompt design, and production AI engineering
Requirements:
- 8+ years of software engineering experience, with at least 1-2 years building production LLM-powered systems — ideally agentic chat, copilots, or multi-step agent workflows
- Strong production Python — FastAPI, asyncio, type hints, testing discipline. You've built and operated Python services at meaningful scale
- Hands-on experience with LLM orchestration frameworks like LangGraph, LangChain, LlamaIndex, or equivalent — and an opinion on when to use them vs. build your own
- Deep familiarity with tool-use / function-calling patterns. Bonus if you've built or integrated MCP (Model Context Protocol) servers, but strong tool-use experience in any framework translates
- Experience designing multi-agent or multi-step workflows: planner/executor patterns, agent handoff, state management, error recovery, human-in-the-loop
- A real point of view on evals and observability for LLM systems — you've built (or fought to build) the feedback loops that keep agents from regressing in production
- Hands-on experience testing AI/LLM experiences in production — building eval datasets, scoring rubrics (LLM-as-judge, human-in-the-loop, deterministic checks), regression suites, and the discipline to know which one applies when. You understand why traditional unit tests aren't enough for non-deterministic systems and have built the testing patterns that fill the gap
- Track record of shipping at the Staff level — you've operated as a technical leader across teams, not just an individual contributor with a senior title. The bar is delivery and influence, not slide decks
- Experience with RAG, vector databases, embedding pipelines, and retrieval quality tuning
- Familiarity with Anthropic's Claude API, OpenAI's Responses API, or comparable provider SDKs at the level of tool use, structured outputs, and streaming
- Experience instrumenting LLM systems with tools like LangSmith, Langfuse, Arize, Braintrust, or homegrown tracing
- Experience with AI testing tooling (Braintrust, Langfuse, Patronus, DeepEval, Promptfoo, or equivalent), or having built homegrown versions of these
- Familiarity with red-teaming, prompt injection testing, or adversarial evaluation of agent systems
- Experience building backend systems for IoT or connected devices — reasoning about device state, telemetry streams, intermittent connectivity, command/response patterns, and the kind of real-world messiness that doesn't show up in pure SaaS backends. Bonus if you've designed APIs or agents that operate over a fleet of devices
- Experience working with mobile clients (iOS / Android) as API consumers of an agent backend
- Prior work on prompt engineering at scale, including prompt versioning, A/B testing, and prompt regression frameworks