SPECTRAFOR is seeking a Principal Site Reliability Engineer to ensure the availability, reliability, and performance of clients' Demo Platform AI services. The role involves managing complex cloud systems, implementing Model as a Service infrastructure, and collaborating with various teams to align requirements with functional capabilities.
Responsibilities:
- Design, develop, and implement robust, scalable, and secure IT infrastructure solutions, aligned with business objectives and industry best practices
- Implement automation and DevOps processes to improve the cloud life cycle, including infrastructure and application uptime, availability, right-sizing and time-to-market
- Collaborate with teammates and project stakeholders to meet timelines, goals and SLA
- Design and implement Model as a Service platform utilizing Red Hat AI products and GPU enabled Intel hardware systems
- Perform architectural planning, deployment, and management of OpenShift Container Platform environments
- Architect and optimize virtualization solutions using KVM/QEMU, including advanced capabilities offered by OpenShift Virtualization (Kubevirt)
- Design and implement advanced network architectures, particularly Software-Defined Networking (SDN) and Open Virtual Network (OVN), ensuring high performance and reliability
- Develop comprehensive storage strategies, including the design and administration of physical storage solutions and distributed storage systems like Ceph / OpenShift Data Foundation (ODF)
- Oversee the administration and automation of bare-metal infrastructure, ensuring optimal performance and resource utilization
- Drive automation initiatives using Ansible and Red Hat Advanced Cluster Manager for Kubernetes (ACM) for infrastructure provisioning, configuration management, and operational tasks
- Establish and optimize CI/CD pipelines for infrastructure and platform deployments, promoting agile and efficient delivery
- Provide technical leadership, mentorship, and guidance to engineering teams on architectural patterns and best practices
- Evaluate new technologies and trends, recommending solutions that enhance our IT landscape and provide competitive advantages
- Collaborate cross-functionally with development, operations, and business teams to gather requirements and translate them into architectural designs
- Create and maintain detailed architectural documentation, including design specifications, diagrams, and operational guides
- Contribute to performance testing and tuning, quality assurance (QA), ticket and incident management
Requirements:
- 8+ years of progressive experience in IT architecture, with a significant focus on infrastructure design and implementation
- 5+ years of experience with Public Cloud, Virtualization and Linux technologies, specifically KVM/QEMU, and a strong understanding of OpenShift Virtualization (Kubevirt)
- 5+ years of experience with Red Hat OpenShift Container Platform or Kubernetes including cluster operations, networking, storage integration, and security
- 3+ years of experience with automation frameworks and tools like Ansible or Terraform
- Hands-on experience with Bare-metal administration, including hardware provisioning, firmware management, and operating system deployment
- Solid understanding and practical experience with CI/CD methodologies and tools for automated deployments
- Strong problem-solving abilities, analytical skills, and a strategic mindset
- Excellent communication, presentation, and interpersonal skills, capable of articulating complex technical concepts to diverse audiences
- Experience with AI/ML technologies and recent developments including OpenShift AI, inference systems and technologies like vLLM
- Proven experience with enterprise-grade storage solutions, Software-Defined Storage technologies especially Ceph and ODF
- Advanced knowledge of Software-Defined Networking (SDN) principles and practical experience with Open Virtual Network (OVN)
- Extensive experience with Red Hat Enterprise Linux (RHEL) administration and design
- Experience with AppDev automation and pipelines including technologies like Jenkins, Tekton, ArgoCD, etc
- Experience with networking technologies including VLANs, routing protocols, IPAM solutions, and Load Balancers
- Experience in designing and delivering implementations using various public and private cloud infrastructure technologies and providers
- Experience in Python development for automation, scripting, and tool development
- Experience in Go development for building high-performance applications or infrastructure components
- Relevant certifications (e.g., Red Hat Certified Architect, Kubernetes certifications, industry cloud certifications)