ExpandIQ is a leading PE-backed SaaS platform serving a highly regulated industry at scale, focused on revolutionizing software development through AI agents. The AgenticOps Engineer will own the operational layer for a fleet of AI agents, ensuring reliability and security across the software delivery lifecycle.

Responsibilities:

Build and maintain the end-to-end platform orchestrating the agent fleet: task intake, routing, sandboxed execution, automated validation gates, output submission, and feedback loops
Select, configure, and tune agents for the task types, languages, and codebases they are assigned to; evaluate new agents and model versions as the market evolves
Build and maintain automated gates before any human review: test passage, coverage thresholds, style compliance, security scanning, and build integrity; own evaluation harnesses and regression suites for agent workflows
Partner with engineering leads to define what agent-ready means for each SDLC phase; shape the intake process and drive the organization toward self-service as the practice matures
Own dashboards, logging, alerting, and analytics providing visibility into agent behavior, performance, cost, and outcomes across the fleet; surface degradation before teams feel it
Monitor and optimize LLM spend and compute; track cost per unit of work produced — dollars per merged PR, per generated test suite, per validated deployment — and drive it down
Enforce agent access controls, data handling policies, and audit trail requirements; ensure every agent-produced artifact is traceable end-to-end
Serve as the on-call specialist when engineers hit persistent walls with agent output; diagnose root cause, pair on fixes, and roll learnings back into shared configuration and documentation

Requirements:

4+ years of software engineering experience with strong fundamentals in systems thinking and debugging
Hands-on, current experience building with LLM APIs — prompt design, tool use, function calling, context management
Demonstrated ability to diagnose and resolve complex cross-cutting technical issues across multiple teams and systems
Strong analytical skills — comfortable building dashboards, writing queries, and reasoning about statistical patterns in non-deterministic system output
Working knowledge of secure software development practices — access control, audit logging, sensitive data handling in automated pipelines
Excellent written and verbal communication — this role lives on documentation, cross-team clarity, and knowledge transfer
Experience with prompt evaluation frameworks and LLM observability tooling — LangSmith, Braintrust, Humanloop, or equivalent
Background in developer tooling, platform engineering, or SRE/DevOps with reliability principles applied to non-deterministic systems
Familiarity with multiple LLM providers and coding agents — Claude Code, Codex, Devin
Hands-on experience with Kubernetes, Helm, AWS EKS, Terraform, and GitLab CI
Familiarity with MCP — Model Context Protocol — including servers, clients, tools, and resource exposure
Exposure to SOC 2, ISO 27001, or similar compliance frameworks and producing audit evidence for automated systems
Experience working cross-functionally across multiple product teams without direct authority

AgenticOps Engineer

Key skills

About this role

Responsibilities:

Requirements: