Role Overview

Advise on and help maintain large-scale computational and AI infrastructure, including monitoring, logging, and workload orchestration (Kubernetes and Linux job schedulers).
Provide consultative guidance and perform hands-on solving across the full stack—from bare metal and operating system, through the software stack, container platform, networking, and storage.
Assess customer environments and recommend optimized, production-ready Kubernetes-based container platforms integrated with enterprise-grade networking and storage solutions.
Serve as a key technical resource: develop, refine, and document standard methodologies and operational guidelines to be shared with internal teams and customer partners.
Support Research & Development activities and engage in POCs/POVs to validate new features, architectures, and upgrade approaches.
Create and deliver high-quality documentation, including runbooks, onboarding materials, and best-practice guides for customers and internal teams.
Act as the technical leader for assigned customer accounts, providing strategic guidance on DevOps and platform architecture and influencing long-term infrastructure and operations decisions.

Requirements

BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields (or equivalent experience) with 8+ years of professional experience in leading scalable cloud environments and automation engineering roles.
Shown understanding of networking fundamentals, data center architectures, and hands-on experience leading HPC/AI clusters, including deployment, optimization, and solving.
Validated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure, including driver management, CUDA toolkit integration, and GPU workload profiling.
Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with GPU-accelerated and HPC environments.
Strong familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks.
Deep knowledge of Linux (RedHat, Ubuntu), OS-level security, and protocols.
Experience with storage solutions such as Lustre, GPFS, ZFS, XFS, and emerging Kubernetes storage technologies.
Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools (e.g., Ansible, Terraform).
Experience with observability stacks (Grafana, Loki, Prometheus) for monitoring, logging, and building fault-tolerant systems.
Strong background in crafting scalable solutions and providing consultative support to customers, including leading architectural reviews and speaking publicly to executive partners.

Tech Stack

Ansible
Cloud
Grafana
Kubernetes
Linux
Prometheus
Python
Terraform

Senior Solutions Architect, Cloud Infrastructure – DevOps

Key skills

About this role

Role Overview

Requirements

Tech Stack