Design and implement cloud-native infrastructure supporting Agentforce and the broader Salesforce AI Platform, including agent orchestration services, model serving APIs, and LLM gateway components.
Drive the platform's AIOps capabilities, including observability integrations for noise reduction, anomaly detection, and proactive incident detection across AI services.
Architect and build AI agents and intelligent pipelines that automate operational workflows, encompassing log ingestion, real-time signal analysis, trend detection, and surfacing actionable insights for engineering teams.
Lead infrastructure migrations (cluster upgrades, platform unification, service onboarding) with a focus on safety, automation, and zero disruption to production AI services.
Partner with AI/ML engineers and data scientists to ensure the platform supports model training, inference, fine-tuning, and agent orchestration workloads efficiently.
Drive cost optimization across compute-heavy AI workloads (GPU instances, node pools, inference endpoints).
Requirements
8+ years of software development or infrastructure engineering experience.
Deep expertise with Kubernetes, managing, upgrading, and operating a fleet of services in production environments.
Experience building and deploying infrastructure for AI/ML or LLM workloads (model serving, inference APIs, agent orchestration services).
Proficiency with AWS (EKS, SageMaker, S3, EC2, IAM), including writing custom IAM policies and managing cross-account access patterns.
Hands-on experience with AIOps or observability tooling, alert management, log analysis pipelines, and APM platforms.
Experience with infrastructure-as-code and CI/CD deployment tooling.
Strong grasp of DevOps/SRE principles, run-books, alerting, on-call readiness, and operational documentation.
Proficiency in Python and/or Go; comfortable with Bash for automation.