ExpandIQ is a leading PE-backed SaaS platform serving a highly regulated industry at scale, focused on revolutionizing software development through AI agents. The AgenticOps Engineer will own the operational layer for a fleet of AI agents, ensuring reliability and security across the software delivery lifecycle.
Responsibilities:
- Build and maintain the end-to-end platform orchestrating the agent fleet: task intake, routing, sandboxed execution, automated validation gates, output submission, and feedback loops
- Select, configure, and tune agents for the task types, languages, and codebases they are assigned to; evaluate new agents and model versions as the market evolves
- Build and maintain automated gates before any human review: test passage, coverage thresholds, style compliance, security scanning, and build integrity; own evaluation harnesses and regression suites for agent workflows
- Partner with engineering leads to define what agent-ready means for each SDLC phase; shape the intake process and drive the organization toward self-service as the practice matures
- Own dashboards, logging, alerting, and analytics providing visibility into agent behavior, performance, cost, and outcomes across the fleet; surface degradation before teams feel it
- Monitor and optimize LLM spend and compute; track cost per unit of work produced — dollars per merged PR, per generated test suite, per validated deployment — and drive it down
- Enforce agent access controls, data handling policies, and audit trail requirements; ensure every agent-produced artifact is traceable end-to-end
- Serve as the on-call specialist when engineers hit persistent walls with agent output; diagnose root cause, pair on fixes, and roll learnings back into shared configuration and documentation
Requirements:
- 4+ years of software engineering experience with strong fundamentals in systems thinking and debugging
- Hands-on, current experience building with LLM APIs — prompt design, tool use, function calling, context management
- Demonstrated ability to diagnose and resolve complex cross-cutting technical issues across multiple teams and systems
- Strong analytical skills — comfortable building dashboards, writing queries, and reasoning about statistical patterns in non-deterministic system output
- Working knowledge of secure software development practices — access control, audit logging, sensitive data handling in automated pipelines
- Excellent written and verbal communication — this role lives on documentation, cross-team clarity, and knowledge transfer
- Experience with prompt evaluation frameworks and LLM observability tooling — LangSmith, Braintrust, Humanloop, or equivalent
- Background in developer tooling, platform engineering, or SRE/DevOps with reliability principles applied to non-deterministic systems
- Familiarity with multiple LLM providers and coding agents — Claude Code, Codex, Devin
- Hands-on experience with Kubernetes, Helm, AWS EKS, Terraform, and GitLab CI
- Familiarity with MCP — Model Context Protocol — including servers, clients, tools, and resource exposure
- Exposure to SOC 2, ISO 27001, or similar compliance frameworks and producing audit evidence for automated systems
- Experience working cross-functionally across multiple product teams without direct authority