Cisco is at the forefront of integrating artificial intelligence into their platforms, transforming collaboration, security, networking, and more. They are seeking an AI Infrastructure Site Reliability Engineer to leverage SRE practices, automate operational capabilities, and ensure the availability and efficiency of AI platforms. The role involves working with top AI experts to contribute to ethical AI products and solutions.
Responsibilities:
- Leverage SRE practices to reduce toil and maintain Service Level Objectives (SLOs) for internal AI platforms
- Lead, build, and run fully automated pipelines through CI/CD systems for operational excellence and continuous improvements
- Ensure the availability, scalability, latency, and efficiency of NVIDIA DGX and Cisco-UCS infrastructure using fault-tolerant engineering approaches
- Drive capacity planning, performance analysis, instrumentation, and other non-functional requirements
- Automate operational capabilities using Python, Ansible, Terraform, Go, and related technologies
- Deliver automation through CI/CD pipelines and chatbot integrations
- Implement metrics-driven processes to maintain high service quality
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field; or equivalent years of IT experience
- 5+ years Experience deploying and administering NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g., Cray, HPE, IBM)
- 5+ years coordinating and supporting Linux-based operating systems
- 5+ years Proficiency in programming languages such as Python, Go, C/C++; experience with Git and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
- 5+ years experience deploying enterprise-grade Kubernetes clusters (RedHat OpenShift preferred) and/or Google Anthos
- Advanced knowledge of Kubernetes, Docker, Terraform, Ansible, Jenkins, GitOps, Git, and Linux
- 5+ years Experience with the software development lifecycle: design, development, testing, packaging, and deployment (preferably using Python or Go)
- Master's degree or equivalent experience in a relevant field
- Certifications in Linux, networking, cloud, or related technologies
- Previous experience as a compute or site/systems reliability engineer
- Experience with hybrid cloud, virtualization, and container technologies
- Familiarity with Agile and DevOps operating models, including project tracking tools (e.g., Jira, Rally)
- Excellent collaboration, leadership, and communication skills