Best Egg, now part of Barclays, is a market-leading, tech-enabled financial platform helping people build financial confidence through innovative lending solutions and financial health tools. They are seeking a Lead Software Engineer II for AI Operations to design, ship, and operate production-grade LLM applications, agents, and automations across the business.
Responsibilities:
- Build and ship LLM apps & agents: Deliver internal copilots and customer/agent-facing automations with clear SLAs, rollbacks, and observability from day one
- Own RAG pipelines: Design ingestion, chunking, embeddings, indexing, hybrid search/rerank, and retrieval evaluation; track retriever quality via offline golden sets and online metrics
- AWS Infrastructure & Orchestration: Design and implement scalable AWS architectures, including AWS AI features such as Bedrock, IAM, knowledge bases, secure secrets and policy enforcement, automated provisioning, and resource-usage governance as core platform capabilities
- Observability & SRE for AI: Add tracing, prompt/agent version lineage, eval dashboards, and regression alerts; establish golden datasets and canary tests
- Guardrails & governance: Enforce PII redaction, safety filters, role-based access, audit logs, and human‑in‑the‑loop review paths to control quality and risk
- CI/CD for AI artifacts: Version and deploy prompts, tools, agents, and retrieval pipelines; support blue/green and shadow deploys with automatic rollback triggers
- Cost & performance: Cut run‑rate spend through caching, truncation, batching, autoscaling, and model routing; establish clear unit economics per workflow
- Developer enablement: Provide templates, SDKs, and high‑quality abstractions that let product teams ship safely without bespoke plumbing; improve developer experience
- Platform integration: Build primarily in Python and Metaflow (Outerbounds); deploy on AWS (Bedrock + core services) and OpenAI; use Cursor in daily workflows; help evaluate and, when appropriate, run on Databricks
- Production posture: Participate in on‑call, author runbooks, and remove single‑thread risk for AI services; drive reliability and resilience akin to ML Ops
Requirements:
- 5–10 years of professional software engineering (or equivalent) with 2+ years building AI/LLM applications
- Portfolio of shipped AI projects (links to code, demos, or case studies)
- Demonstrated passion for relentless exploration of the latest AI models, frameworks, and tooling
- Hands-on with some/all of OpenAI, Bedrock, Huggingface/Ollama/vLLM
- Practical experience designing and tuning retrieval systems (chunking, embeddings, hybrid search, reranking)
- Comfortable building APIs/services and simple UIs where needed
- Strong fundamentals in Python and modern packaging/testing
- CI/CD, containers, cloud fundamentals (AWS), and runtime performance tuning
- Experience operating services in production
- Metaflow (Outerbounds) preferred
- Tracing and logging, expertise in tools like Datadog, Dynatrace or Grafana where relevant for AI monitoring
- Comfortable optimizing latency/throughput/cost, and implementing guardrails for PII/safety/compliance
- Partner effectively with data scientists, analysts, and engineers
- Promote best practices and high-leverage abstractions
- Databricks familiarity is a plus
- Fine-tuning or distillation experience
- Kubernetes or FastAPI exposure
- Familiarity with Snowflake or similar warehousing for retrieval sources