Valiant Harbor International is seeking a Software Development Engineer III – Infrastructure to support the Director’s Office at the Advanced Research Projects Agency for Health (ARPA-H). The role involves managing backend infrastructure for GRACE on Microsoft Azure, contributing to the development of agentic AI systems, and ensuring production quality and observability for workflows.
Responsibilities:
- Manage end-to-end backend infrastructure for GRACE on Microsoft Azure:
- Azure Functions, Azure API Management, Azure Container Apps, and Azure OpenAI Service
- Manage storage, retrieval pipelines, vector databases, and document indexing that power GRACE's internal knowledge search
- Authentication and identity integration, including ARPA-H Entra ID and application-level access control
- Implement and maintain infrastructure as code for all environments
- Own CI/CD pipelines, deployment automation, and release processes including canary and gradual rollouts
- Be responsible for production system basics (e.g., monitoring, alerting, logging, distributed tracing, SLOs, and incident response runbooks)
- Manage secrets, API keys, and credential rotation across all integrations with external providers
- Monitor for cost-related efficiencies across all LLM providers; track spending, set budgets, build guardrails, and optimize for cost-per-query without sacrificing quality
- Manage the backend implementation of MCP, including MCP server hosting, tool registration, versioning, and lifecycle management on Azure
- Implement and evolve A2A communication patterns to enabling GRACE agents interoperability with internal/external systems
- Design and maintain LLM orchestration, routing, and multi-model switching infrastructure across OpenAI GPT, Anthropic Claude, and Google Gemini families
- Build and operate RAG pipelines; document ingestion, chunking, embedding, and semantic search
- Implement robust fallback, retry, circuit-breaker, and graceful degradation patterns for all AI service dependencies
- Manage observability and production quality:
- Build and maintain end-to-end observability for agentic workflows: latency, throughput, error rates, token usage, and LLM quality metrics
- Implement LLM evaluation pipelines including safety checks, regression monitoring, and grounding assessment
- Define and enforce system-level SLOs for availability, response time, and tool call reliability
- Manage alerting and on-call runbooks
- Collaborate and foster teamwork:
- Establish and improve coding standards, design review processes, and testing practices
- Communicate technical decisions in writing and in conversation to both engineers and non-engineers
- Mentor and guide other engineers
- Think inventively and consider other perspectives; work backward from the user to understand problems before proposing solutions
- Ensure strict privacy, security, and compliance in all systems, integrations, and data handling
Requirements:
- Bachelor's or Master's in Computer Science, Software Engineering, or related field, or equivalent practical experience
- 7+ years of professional software engineering experience building and operating production systems
- Proven experience in high-velocity environments shipping complex products end-to-end
- Strong proficiency in backed languages (to include Python); familiarity with modern backend frameworks and async patterns
- Solid understanding of distributed systems, APIs, data pipelines, and software design patterns
- Hands-on experience on Microsoft Azure: Azure Functions, API Management, Container Apps, and Azure OpenAI Service
- Experience with containerization, CI/CD, and infrastructure as code
- Strong understanding of authentication and identity systems (OAuth2, OIDC, Azure Entra ID or equivalent)
- Demonstrated experience/ability with production systems (having been on-call, debugged incidents, etc.)
- Excellent communication and team building skills; focused on making others around them better
- Hands-on experience building and operating MCP servers in production, including tool registration, versioning, and hosting on Azure Functions or equivalent serverless infrastructure
- Experience implementing A2A communication patterns and multi-agent orchestration frameworks
- Significant experience building on top of LLMs in production (tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management)
- Ability to demonstrate considerations for cost-per-query, context budgets, and prompt efficiency as first-class engineering concerns
- Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning
- Experience in security-conscious engineering within regulated or government environments
- Previous track record in startup or early-stage environments (0-to-1 product building, comfort with ambiguity, and a high sense of urgency)
- Experience in big tech building customer-facing platforms or developer infrastructure at scale
- Familiarity with vector databases, embedding pipelines, and semantic search infrastructure