Lattice is a people success platform focused on building cultures where employees and companies thrive. They are seeking a Senior Software Engineer to join their AI Engineering team, responsible for developing evaluation frameworks and agent infrastructure to enhance AI performance across the organization.
Responsibilities:
- Design and ship a robust, end-to-end AI evaluation framework, covering offline evals, production tracing, and human-in-the-loop feedback loops, connected across all of Lattice’s AI use cases
- Define and instrument the metrics that actually matter: agent task completion, hallucination rates, response quality, user engagement, and downstream business outcomes
- Build and maintain evaluation datasets, test harnesses, and automated scoring pipelines to catch regressions before they ship
- Identify and surface the drivers of agent quality improvement, giving the team clear signals on where to invest
- Architect and implement reusable agent infrastructure: multi-turn conversation workflows, recommendation services, LLM DAGs, and standardized agent topology patterns using LangGraph
- Build and scale RAG pipelines and retrieval infrastructure, including vector store management and retrieval quality optimization
- Make principled build vs. buy decisions across LLM providers, agent frameworks, and evaluation tooling, balancing capability, cost, latency, and vendor risk
- Contribute to production AI systems with a strong focus on reliability, observability, and performance, not just prototypes
- Own projects end-to-end: scope them, drive them to completion, and bring in the right people at the right time
- Partner with engineering leads and managers to inform technical direction on agent quality and evaluation strategy you’ll be expected to hold intelligent, substantive conversations about methodology, not just implementation
- Raise the AI engineering bar across the broader team through code review, documentation, and thoughtful technical debate
Requirements:
- 5+ years of professional software engineering experience with significant time spent on production AI/ML systems
- Deep hands-on experience with LLM-based systems: prompt engineering, RAG pipelines, agent orchestration, evaluation metrics, and model fine-tuning
- Proven ability to work with data and understand statistics, especially in experiments
- Proven ability to build and operate agentic AI systems in production: multi-step workflows, multi-agent topologies, and the failure modes that come with them
- Strong command of AI evaluation: you've built eval frameworks before, you know the difference between a good eval and a vanity metric, and you have opinions about it
- Production-grade Python engineering: clean, maintainable, testable code
- LangGraph or comparable agent orchestration frameworks. You've built real agent workflows with it, not just tutorials
- LangSmith or comparable LLM observability tooling for tracing, evaluation, and debugging
- Reads AI papers & blogs regularly and is a trusted source of AI trends
- Vector databases (Pinecone or similar) and retrieval system design
- AWS ecosystem or other cloud infrastructure (ex GCP). Comfortable with lambdas, queues, and cloud-native architecture
- Familiarity with TypeScript is a plus. Our full-stack engineers use it and cross-pollination is valuable
- Clear eyes: you see problems as they are, not as you'd like them to be. You surface hard truths early and address them directly
- Ship, shipmate, self: you prioritize the product and your teammates. Low ego, high ownership
- You're as comfortable in ambiguity as you are in well-defined problems: early foundations mean you'll encounter both
- Strong technical communication: you can debate evaluation methodology with an AI lead and explain it clearly to an EM in the same afternoon
- Experience with RLHF, LoRA, or other model adaptation techniques
- Background in traditional ML (supervised/unsupervised, neural networks) and knowing when an LLM is overkill
- Experience with MLOps tooling: MLflow, DataDog, CI/CD pipelines for model deployment
- Published work, conference talks, or open-source contributions in AI/ML
- Experience in HR tech, people analytics, or other domains where data quality and trust are critical