Peach Pilot is a company that transforms how businesses run by building a platform that connects operations through AI. The Senior Quality Assurance Engineer will be responsible for establishing testing frameworks and ensuring the quality of AI-generated outputs before they reach clients, playing a crucial role in maintaining trust and reliability in the company's deliverables.
Responsibilities:
- Establish the testing framework: unit, integration, end-to-end, and AI-specific evaluation pipelines using Playwright and Vitest
- Define quality standards, test coverage requirements, and documentation practices in partnership with the Lead Engineer
- Audit the existing platform and identify the highest-risk surfaces before the next client deployment
- Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring
- Build automated test suites for the agent orchestration layer, including governance-agent audit-trail integrity and human-override behavior
- Validate the Company Brain (Memgraph + Qdrant) for data accuracy, retrieval quality, and failure modes under real enterprise data including entity resolution across systems and temporal data patterns
- Test the Analysis Engine pipeline that surfaces Company X-Ray findings ensuring insights are not just technically accurate but reliable enough to present to a client
- Own end-to-end testing of the data ingestion pipelines that connect to client systems CRM, email, calls, calendars, documents, financial systems through Nango's 700+ connector integration layer
- Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers via LiteLLM
- Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow
- Own file ingestion pipeline testing (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit-trail continuity
Requirements:
- 7+ years of QA engineering experience, with at least 3 years in a senior or lead capacity where you shaped process and standards not just executed them
- You have tested AI/LLM-powered applications. You understand prompt sensitivity, output variance, and how to build eval pipelines that catch regressions across model updates
- You speak in ownership: you've built the eval pipeline, owned model quality, gated the release — not just run someone else's test suite
- You write test code. Python is your primary tool. You have built and maintained CI/CD-integrated test suites, and you don't wait for someone to file a bug to find one
- Hands-on experience with Playwright and Vitest in a production environment and you've built automation frameworks from scratch, not just inherited them
- Comfortable testing complex API chains, async/streaming responses, and multi-service workflows. Data pipelines and knowledge graph outputs don't intimidate you
- You test for confusion and trust failure not just broken functionality. Your end users are non-technical executives, and you advocate for them
- US-based, able to overlap roughly 5 hours per day with EDT, and available for full-time contract hours
- You have experience with LLM evaluation frameworks (e.g., LangSmith, DeepEval, Promptfoo, RAGAS, or custom eval pipelines)
- You have tested agent frameworks or orchestration layers in a production environment
- You have a background in a regulated industry (insurance, finance, healthcare) where audit-trail integrity is non-negotiable
- You have worked alongside Forward Deployed or solutions engineering teams and understand field deployment risk