Valiant Harbor International is seeking a Software Development Engineer III – Infrastructure to support the Director’s Office at the Advanced Research Projects Agency for Health (ARPA-H). The role involves managing backend infrastructure for GRACE on Microsoft Azure, contributing to the development of agentic AI systems, and ensuring production quality and observability for workflows.

Responsibilities:

Manage end-to-end backend infrastructure for GRACE on Microsoft Azure:
- Azure Functions, Azure API Management, Azure Container Apps, and Azure OpenAI Service
- Manage storage, retrieval pipelines, vector databases, and document indexing that power GRACE's internal knowledge search
- Authentication and identity integration, including ARPA-H Entra ID and application-level access control
- Implement and maintain infrastructure as code for all environments
- Own CI/CD pipelines, deployment automation, and release processes including canary and gradual rollouts
- Be responsible for production system basics (e.g., monitoring, alerting, logging, distributed tracing, SLOs, and incident response runbooks)
- Manage secrets, API keys, and credential rotation across all integrations with external providers
- Monitor for cost-related efficiencies across all LLM providers; track spending, set budgets, build guardrails, and optimize for cost-per-query without sacrificing quality
Manage the backend implementation of MCP, including MCP server hosting, tool registration, versioning, and lifecycle management on Azure
- Implement and evolve A2A communication patterns to enabling GRACE agents interoperability with internal/external systems
- Design and maintain LLM orchestration, routing, and multi-model switching infrastructure across OpenAI GPT, Anthropic Claude, and Google Gemini families
- Build and operate RAG pipelines; document ingestion, chunking, embedding, and semantic search
- Implement robust fallback, retry, circuit-breaker, and graceful degradation patterns for all AI service dependencies
Manage observability and production quality:
- Build and maintain end-to-end observability for agentic workflows: latency, throughput, error rates, token usage, and LLM quality metrics
- Implement LLM evaluation pipelines including safety checks, regression monitoring, and grounding assessment
- Define and enforce system-level SLOs for availability, response time, and tool call reliability
- Manage alerting and on-call runbooks
Collaborate and foster teamwork:
- Establish and improve coding standards, design review processes, and testing practices
- Communicate technical decisions in writing and in conversation to both engineers and non-engineers
- Mentor and guide other engineers
- Think inventively and consider other perspectives; work backward from the user to understand problems before proposing solutions
- Ensure strict privacy, security, and compliance in all systems, integrations, and data handling

Requirements:

Bachelor's or Master's in Computer Science, Software Engineering, or related field, or equivalent practical experience
7+ years of professional software engineering experience building and operating production systems
Proven experience in high-velocity environments shipping complex products end-to-end
Strong proficiency in backed languages (to include Python); familiarity with modern backend frameworks and async patterns
Solid understanding of distributed systems, APIs, data pipelines, and software design patterns
Hands-on experience on Microsoft Azure: Azure Functions, API Management, Container Apps, and Azure OpenAI Service
Experience with containerization, CI/CD, and infrastructure as code
Strong understanding of authentication and identity systems (OAuth2, OIDC, Azure Entra ID or equivalent)
Demonstrated experience/ability with production systems (having been on-call, debugged incidents, etc.)
Excellent communication and team building skills; focused on making others around them better
Hands-on experience building and operating MCP servers in production, including tool registration, versioning, and hosting on Azure Functions or equivalent serverless infrastructure
Experience implementing A2A communication patterns and multi-agent orchestration frameworks
Significant experience building on top of LLMs in production (tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management)
Ability to demonstrate considerations for cost-per-query, context budgets, and prompt efficiency as first-class engineering concerns
Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning
Experience in security-conscious engineering within regulated or government environments
Previous track record in startup or early-stage environments (0-to-1 product building, comfort with ambiguity, and a high sense of urgency)
Experience in big tech building customer-facing platforms or developer infrastructure at scale
Familiarity with vector databases, embedding pipelines, and semantic search infrastructure

Software Development Engineer III - Infrastructure

Key skills

About this role

Responsibilities:

Requirements: