Ensure the production-grade reliability, accuracy, and performance of our AWS-based agentic AI ecosystem
Lead investigations of complex agent/AI workflow failures using logs, metrics, and traces
Improve the quality and performance of Retrieval-Augmented Generation (RAG) and agent workflows
Establish and oversee evaluation approaches for models, RAG, and agents
Partner with InfoSec/AppSec to review architectures and ensure designs follow enterprise security patterns
Work with Governance teams to implement and monitor guardrails and controls across the AI platform
Drive 'Design for Reliability' patterns across both Platform and Agent Building teams
Translate reliability risks, performance trends, and operational metrics into clear business language for senior leaders, risk, and product owners
Coach DevLeads and architects on debugging agent behaviors, strengthening observability pipelines, improving orchestration, and hardening production deployments
Requirements
Bachelor's degree in Computer Science, Engineering, Information Systems, or related field (or equivalent experience)
10–14 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture or in lieu of a degree 12–16 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture
Strong experience operating and improving reliability of cloud-native systems (AWS preferred; comparable cloud experience acceptable)
Experience supporting AI/ML systems is beneficial, but not mandatory if you demonstrate strong troubleshooting ability
Strong ability to script/build tooling in Python (or similar language) for reliability automation, analysis, testing, and operational workflows
Hands-on experience with observability practices and tools (CloudWatch/X-Ray/Splunk/New Relic or similar)
Experience with Infrastructure-as-Code (Terraform preferred; similar tools acceptable)
Working knowledge of identity and security patterns (OAuth2, SSO/federation, IAM roles/policies/SCP concepts)
Proven ability to lead through influence, drive standards/guardrails, and align multiple agile teams in a matrixed environment
Tech Stack
AWS
Cloud
Python
Ray
Splunk
Terraform
Benefits
best-in-class employee benefits and programs that cater to work-life integration and overall well-being
career advancement and upskilling opportunities, focusing on Advancing Diverse Talent to take up leadership roles