Drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems.
Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack.
Engage in metrics and measurements to ensure reliability and availability.
Enable scaling by providing tools, developing training and/or augmenting processes.
Build tools/automate to prevent re-occurrence of problems to mission-critical products/services.
Augment existing instrumentation to build a cohesive picture of the characteristics of systems with special attention to points of failure.
Drive the team to build and scale fault-tolerant system and services in hybrid cloud infrastructure.
Partner with leadership across the organization to establish strategic plans and objectives to improve mean time to detect and mean time to restore.
Requirements
Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years' experience in software engineering or related area, or 8 years' experience in software engineering or related area.
3 years' supervisory experience.
Expert-level AI/ML engineering experience with deep expertise in machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML system deployment at scale.
Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
Comprehensive Site Reliability Engineering expertise including hands-on experience with Service Management (Incident, Problem & Change Management), Performance and Capacity Engineering for AI/ML systems.
Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of cloud-native AI/ML services, containerization (Kubernetes, Docker), and serverless architectures.
Deep observability and monitoring expertise with hands-on experience in: Distributed tracing (Jaeger, Zipkin, OpenTelemetry) for AI/ML pipelines, Metrics collection and alerting (Prometheus, Grafana, DataDog) with ML-specific dashboards, Log aggregation and analysis (ELK stack, Splunk, Fluentd) for model and system monitoring, APM tools and performance monitoring for AI/ML workloads, AI-driven anomaly detection and predictive monitoring systems.
Platform Engineering experience including: Building developer platforms and internal tooling for AI/ML teams, Infrastructure as Code (Terraform, CloudFormation, Pulumi), Service mesh architectures (Istio, Linkerd) for AI/ML services, API gateway and microservices platform development, Self-service ML deployment platforms and developer productivity tools.
Experience in large-scale retail, e-commerce, or high-traffic consumer-facing systems with strict availability and performance requirements (strongly preferred).
Tech Stack
AWS
Azure
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Microservices
Prometheus
PyTorch
Splunk
Tensorflow
Terraform
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase, and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.