Role Overview

Build new platform capabilities from architectural designs, translating security, governance, and infrastructure requirements into production-grade infrastructure-as-code.
Design and build the platform security and secrets management layer, ensuring all workloads operate with least-privilege credentials and certificates issued through a governed PKI hierarchy.
Implement and enforce security policy across the cluster using admission control, covering workload configuration, image standards, network traffic, and resource constraints.
Build and establish the platform observability stack, providing consistent log aggregation, metrics, distributed tracing, and alerting across all platform components.
Design and implement GitOps delivery automation, ensuring all platform changes flow through version-controlled, auditable pipelines with drift reconciliation.
Build and configure workload autoscaling, ensuring AI workflow workers scale efficiently and cost-effectively in response to demand.
Implement the AI model routing and gateway layer, enabling governed, auditable routing of model traffic with per-consumer rate limiting.
Own the day-to-day operational health of the platform: monitor for issues, respond to incidents, conduct root-cause analysis, and implement lasting remediation.
Maintain the health of platform data services — database cluster, job queue, and object storage — including backup schedules, failover testing, and capacity management.
Monitor and tune autoscaling and resource configuration as workload patterns evolve, ensuring the platform scales responsively without over-provisioning.
Manage secrets rotation, certificate lifecycle, policy drift detection, and identity configuration as ongoing operational responsibilities.
Proactively identify and resolve technical debt — manual processes, undocumented configurations, legacy credential management, and gaps in observability coverage.

Requirements

Hands-on production delivery experience, not just conceptual familiarity.
Kubernetes — production cluster operation (RKE2, EKS, GKE, or equivalent); Helm, RBAC design, multi-namespace workload management.
Secrets management — production deployment of a secrets management platform (HashiCorp Vault or equivalent), covering PKI, dynamic credentials, and workload secrets injection.
Policy-as-code — admission control policy authoring and enforcement in production Kubernetes environments (OPA/Rego, Kyverno, or equivalent).
GitOps — Fleet, ArgoCD, Flux, or equivalent at production scale; declarative drift reconciliation, rollback strategy, multi-environment targeting.
Observability stack — log aggregation, log pipeline design, distributed tracing (OpenTelemetry or equivalent), and metrics dashboards (Prometheus/Grafana or equivalent).
API gateway engineering — production deployment and operation of an API or AI gateway (Kong, Envoy, or equivalent); rate limiting, plugin/policy authoring, route management.
Linux platform engineering — networking fundamentals, TLS and PKI, CSI storage operations, container runtime.

Tech Stack

Flux
Grafana
Kubernetes
Linux
Prometheus
Vault

Benefits

We empower you to be bold, driving your career to create the future you want.
We celebrate and reward your achievements.
SUSE is a dynamic environment that is evolving rapidly, thus requiring agility, strong entrepreneurship and an open mind.
A compelling opportunity for the right person to join us as we continue to scale and prosper.
Freedom to be yourself in a global community of unique individuals – like you – with different backgrounds, talents, skills and perspectives.
A truly open community where everyone is welcome, has a voice and is encouraged to reach their full potential regardless of age, gender, race, nationality, disability, sexual orientation, religion, or any other characteristics.

AI Solutions Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits