Role Overview

You'll set the technical direction — not just execute it.
From initial architecture through production deployment, you'll own the roadmap for Walmart's agentic AI platform for performance and resiliency.
You'll have the autonomy to make architectural tradeoffs, drive experimentation, and shape how intelligent systems operate at enterprise scale.
Architect production multi-agent pipelines — from RAG-based knowledge grounding to LLM-driven decision-making and autonomous remediation — operating across 10,500 stores and 240M weekly customers.
Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt.
Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains.
Integrate vector databases and observability stacks to build context-aware systems that act on live signals without human intervention.
Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self-correcting infrastructure — cutting mean time to recovery across critical systems.
Define SLOs that reflect real business impact, integrate performance gates into CI/CD, and make observability (Grafana, Prometheus, ELK, Splunk) actionable across the org.
Write and maintain runbooks that teams actually use: tested, updated after every incident, and clear enough to act on under pressure.
Lead the architectural direction for the org's agentic AI platform — from initial design through production deployment — and own the decisions that follow.
Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday-scale traffic.
Raise the technical floor through design reviews and mentoring that produces engineers who make better decisions independently.

Requirements

10+ years of experience building and operating distributed systems at scale
Proven, hands-on production experience with LLMs, agentic frameworks, or RAG-based systems
Deep background in performance engineering, chaos engineering, or SRE — with real ownership of SLOs and incident response
Strong programming skills in Python and/or Java; comfort working across the full ML stack
Bachelor’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years’ experience in software engineering or related area or 7 years’ experience in software engineering or related area.
Familiarity with ML frameworks: PyTorch, TensorFlow, Hugging Face Transformers
Hands-on with cloud-native infrastructure: GCP, Azure, Kubernetes, Docker
MLOps experience: CI/CD for ML, drift detection, model monitoring
Experimentation background: A/B testing, causal inference, multi-armed bandits
Excellent communication skills — able to align technical and non-technical stakeholders on complex architectural decisions.

Tech Stack

Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Java
Kubernetes
Prometheus
Python
PyTorch
Splunk
Tensorflow

Benefits

Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.

Principal Software Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits