Best Egg, now part of Barclays, is a market-leading, tech-enabled financial platform helping people build financial confidence through innovative lending solutions and financial health tools. They are seeking a Lead Software Engineer II for AI Operations to design, ship, and operate production-grade LLM applications, agents, and automations across the business.

Responsibilities:

Build and ship LLM apps & agents: Deliver internal copilots and customer/agent-facing automations with clear SLAs, rollbacks, and observability from day one
Own RAG pipelines: Design ingestion, chunking, embeddings, indexing, hybrid search/rerank, and retrieval evaluation; track retriever quality via offline golden sets and online metrics
AWS Infrastructure & Orchestration: Design and implement scalable AWS architectures, including AWS AI features such as Bedrock, IAM, knowledge bases, secure secrets and policy enforcement, automated provisioning, and resource-usage governance as core platform capabilities
Observability & SRE for AI: Add tracing, prompt/agent version lineage, eval dashboards, and regression alerts; establish golden datasets and canary tests
Guardrails & governance: Enforce PII redaction, safety filters, role-based access, audit logs, and human‑in‑the‑loop review paths to control quality and risk
CI/CD for AI artifacts: Version and deploy prompts, tools, agents, and retrieval pipelines; support blue/green and shadow deploys with automatic rollback triggers
Cost & performance: Cut run‑rate spend through caching, truncation, batching, autoscaling, and model routing; establish clear unit economics per workflow
Developer enablement: Provide templates, SDKs, and high‑quality abstractions that let product teams ship safely without bespoke plumbing; improve developer experience
Platform integration: Build primarily in Python and Metaflow (Outerbounds); deploy on AWS (Bedrock + core services) and OpenAI; use Cursor in daily workflows; help evaluate and, when appropriate, run on Databricks
Production posture: Participate in on‑call, author runbooks, and remove single‑thread risk for AI services; drive reliability and resilience akin to ML Ops

Requirements:

5–10 years of professional software engineering (or equivalent) with 2+ years building AI/LLM applications
Portfolio of shipped AI projects (links to code, demos, or case studies)
Demonstrated passion for relentless exploration of the latest AI models, frameworks, and tooling
Hands-on with some/all of OpenAI, Bedrock, Huggingface/Ollama/vLLM
Practical experience designing and tuning retrieval systems (chunking, embeddings, hybrid search, reranking)
Comfortable building APIs/services and simple UIs where needed
Strong fundamentals in Python and modern packaging/testing
CI/CD, containers, cloud fundamentals (AWS), and runtime performance tuning
Experience operating services in production
Metaflow (Outerbounds) preferred
Tracing and logging, expertise in tools like Datadog, Dynatrace or Grafana where relevant for AI monitoring
Comfortable optimizing latency/throughput/cost, and implementing guardrails for PII/safety/compliance
Partner effectively with data scientists, analysts, and engineers
Promote best practices and high-leverage abstractions
Databricks familiarity is a plus
Fine-tuning or distillation experience
Kubernetes or FastAPI exposure
Familiarity with Snowflake or similar warehousing for retrieval sources

Lead Software Engineer II, AI Operations

Key skills

About this role

Responsibilities:

Requirements: