Booz Allen Hamilton is seeking an AI Cloud Platform Site Reliability Engineer to ensure the availability and operational integrity of an AWS GovCloud-based agentic AI platform. The role involves collaborating with various teams to monitor and optimize AI operations, manage incidents, and automate processes to enhance system performance.

Responsibilities:

Define, implement, and maintain service level indicators, service level objectives, error budgets, dashboards, alarms, and escalation paths for an agentic AI platform operating in AWS GovCloud
Monitor end-to-end health and performance of agent workflows, model invocations, retrieval or knowledge integrations, orchestration steps, tool calls, and dependent services
Triage incidents, alerts, and operational tickets. Lead root-cause analysis, coordinate recovery actions, and drive post-incident corrective actions that reduce mean time to recovery and prevent recurrence
Build and maintain observability pipelines across metrics, logs, traces, audit telemetry, and operational events using AWS-native tooling and approved enterprise observability tooling
Establish and tune operational thresholds for latency, availability, error rates, token and cost consumption, workflow success rates, tool failure rates, guardrail interventions, and drift-related signals
Partner with platform engineers, cloud engineers, AI agent developers, MLOps engineers, data scientists, and customer SMEs to define ownership boundaries, handoffs, rollback criteria, release readiness gates, and operational support models
Coordinate with MLOps and data science teams when model or data quality degradation, drift, or unexpected behavior requires rollback, retraining, prompt changes, knowledge-base updates, or other corrective actions
Automate remediation and routine operational tasks using Python, shell scripting, infrastructure as code, and event-driven workflows to reduce manual toil
Support secure and compliant operations in regulated national defense environments, including auditability, least-privilege access, controlled logging, and disciplined change management
Work with limited direction, mentor junior team members, and help mature AI operations practices across the program

Requirements:

5+ years of experience supporting production distributed systems such as SRE, Platform Engineering, Cloud Operations, or DevOps
Experience operating workloads on AWS including monitoring, alerting, logging, incident response, troubleshooting, IAM, networking, or secure operations
Experience supporting production AI/ML, generative AI, RAG, agentic AI, model‑serving, or data‑driven decision systems
Experience defining and operating SLIs, SLOs, error budgets, alert thresholds, runbooks, or operational readiness criteria
Experience with observability tooling across metrics, logs, traces, dashboards, or log analytics, including CloudWatch, OpenTelemetry, Prometheus, Grafana, OpenSearch, or ELK
Experience diagnosing issues across containers, orchestration platforms, or cloud runtimes, such as EKS, ECS, Lambda, or EC2
Experience with Python, Bash, or scripting languages to automate operational tasks, health checks, or remediation workflows
Experience participating in on‑call rotations, triaging ticket queues, and leading incident response or post‑incident review activities
Secret clearance
Bachelor's degree
Experience with Amazon Bedrock, Bedrock Agents, Guardrails, Knowledge Bases, model invocation logging, EventBridge, CloudTrail, and CloudWatch‑based monitoring for AI workloads or equivalent tooling for production agentic AI systems
Experience supporting AWS workloads in GovCloud, FedRAMP High, DoD SRG IL4/5, or other regulated or high‑assurance environments
Experience with automation and infrastructure as code using Terraform, CloudFormation, or AWS CDK
Experience with CI/CD release engineering, canary strategies, rollback controls, and change management for cloud services and AI‑enabled applications
Experience with Prometheus‑compatible monitoring, Grafana, OpenSearch/ELK, or other enterprise observability stacks in containerized environments
Experience supporting GPU‑backed inference, self‑hosted model serving, or hybrid AI deployments if the platform evolves beyond managed services
Ability to distinguish infrastructure issues from AI‑specific failure modes including workflow breakdowns, degraded retrieval, safety interventions, regressions, stale knowledge sources, and model or service throttling
Experience working in Agile and cross‑functional environments and collaborating with engineers, operators, mission stakeholders, and technical leadership
AWS Certified CloudOps Engineer, Associate AWS Certified DevOps Engineer, Professional AWS Certified Machine Learning Engineer, Associate AWS Certified Generative AI Developer, Professional AWS Certified Security, or Specialty cloud and AI operations Certifications
CompTIA Security+ or DoD 8570/8140 baseline Certification

AI Cloud Platform Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: