Booz Allen Hamilton is seeking an AI Cloud Platform Site Reliability Engineer to ensure the availability and operational integrity of an AWS GovCloud-based agentic AI platform. The role involves collaborating with various teams to monitor and optimize AI operations, manage incidents, and automate processes to enhance system performance.
Responsibilities:
- Define, implement, and maintain service level indicators, service level objectives, error budgets, dashboards, alarms, and escalation paths for an agentic AI platform operating in AWS GovCloud
- Monitor end-to-end health and performance of agent workflows, model invocations, retrieval or knowledge integrations, orchestration steps, tool calls, and dependent services
- Triage incidents, alerts, and operational tickets. Lead root-cause analysis, coordinate recovery actions, and drive post-incident corrective actions that reduce mean time to recovery and prevent recurrence
- Build and maintain observability pipelines across metrics, logs, traces, audit telemetry, and operational events using AWS-native tooling and approved enterprise observability tooling
- Establish and tune operational thresholds for latency, availability, error rates, token and cost consumption, workflow success rates, tool failure rates, guardrail interventions, and drift-related signals
- Partner with platform engineers, cloud engineers, AI agent developers, MLOps engineers, data scientists, and customer SMEs to define ownership boundaries, handoffs, rollback criteria, release readiness gates, and operational support models
- Coordinate with MLOps and data science teams when model or data quality degradation, drift, or unexpected behavior requires rollback, retraining, prompt changes, knowledge-base updates, or other corrective actions
- Automate remediation and routine operational tasks using Python, shell scripting, infrastructure as code, and event-driven workflows to reduce manual toil
- Support secure and compliant operations in regulated national defense environments, including auditability, least-privilege access, controlled logging, and disciplined change management
- Work with limited direction, mentor junior team members, and help mature AI operations practices across the program
Requirements:
- 5+ years of experience supporting production distributed systems such as SRE, Platform Engineering, Cloud Operations, or DevOps
- Experience operating workloads on AWS including monitoring, alerting, logging, incident response, troubleshooting, IAM, networking, or secure operations
- Experience supporting production AI/ML, generative AI, RAG, agentic AI, model‑serving, or data‑driven decision systems
- Experience defining and operating SLIs, SLOs, error budgets, alert thresholds, runbooks, or operational readiness criteria
- Experience with observability tooling across metrics, logs, traces, dashboards, or log analytics, including CloudWatch, OpenTelemetry, Prometheus, Grafana, OpenSearch, or ELK
- Experience diagnosing issues across containers, orchestration platforms, or cloud runtimes, such as EKS, ECS, Lambda, or EC2
- Experience with Python, Bash, or scripting languages to automate operational tasks, health checks, or remediation workflows
- Experience participating in on‑call rotations, triaging ticket queues, and leading incident response or post‑incident review activities
- Secret clearance
- Bachelor's degree
- Experience with Amazon Bedrock, Bedrock Agents, Guardrails, Knowledge Bases, model invocation logging, EventBridge, CloudTrail, and CloudWatch‑based monitoring for AI workloads or equivalent tooling for production agentic AI systems
- Experience supporting AWS workloads in GovCloud, FedRAMP High, DoD SRG IL4/5, or other regulated or high‑assurance environments
- Experience with automation and infrastructure as code using Terraform, CloudFormation, or AWS CDK
- Experience with CI/CD release engineering, canary strategies, rollback controls, and change management for cloud services and AI‑enabled applications
- Experience with Prometheus‑compatible monitoring, Grafana, OpenSearch/ELK, or other enterprise observability stacks in containerized environments
- Experience supporting GPU‑backed inference, self‑hosted model serving, or hybrid AI deployments if the platform evolves beyond managed services
- Ability to distinguish infrastructure issues from AI‑specific failure modes including workflow breakdowns, degraded retrieval, safety interventions, regressions, stale knowledge sources, and model or service throttling
- Experience working in Agile and cross‑functional environments and collaborating with engineers, operators, mission stakeholders, and technical leadership
- AWS Certified CloudOps Engineer, Associate AWS Certified DevOps Engineer, Professional AWS Certified Machine Learning Engineer, Associate AWS Certified Generative AI Developer, Professional AWS Certified Security, or Specialty cloud and AI operations Certifications
- CompTIA Security+ or DoD 8570/8140 baseline Certification