Role Overview

Provide technical leadership for the cloud platforms, deployment systems, and operational foundations that power enterprise-scale generative AI applications.
Define and evolve the infrastructure architecture for AI/ML platforms running across AWS, Kubernetes, serverless, and containerized environments.
Lead platform standards for reliability, scalability, observability, CI/CD, security, and developer enablement, while partnering closely with software engineering, AI engineering, security, and operations teams.
Define and drive the technical strategy for AI/ML platform infrastructure supporting generative AI applications, LLM integrations, model routing, and enterprise AI services.
Architect, build, and operate scalable cloud platforms using AWS services such as EKS, ECS Fargate, Lambda, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA.
Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to support reliable multi-environment and multi-region deployments.
Lead CI/CD architecture using GitHub Actions, reusable workflows, OIDC-based AWS authentication, automated quality gates, deployment promotion, and environment approvals.
Design and improve observability across AI platforms, including CloudWatch dashboards, logs, alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics.
Build platform capabilities for GenAI workloads, including model availability monitoring.
Partner with software engineering teams to improve deployment reliability, rollback strategies, health checks, autoscaling, load testing, and runtime performance.
Define and enforce security and compliance practices for infrastructure, including IAM permission boundaries, Secrets Manager usage, secret scanning, audit logging, tagging standards, and change-management controls.
Provide technical leadership for cost optimization, capacity planning, environment standardization, and operational resilience across development, test, production, and sandbox environments.
Mentor engineers, review architecture and infrastructure designs, and influence platform engineering practices across teams.

Requirements

Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field, or equivalent practical experience.
7+ years of experience in DevOps, platform engineering, cloud infrastructure, site reliability engineering, or software engineering roles.
Strong hands-on experience with AWS/Azure/GCP infrastructure and services, including container, serverless, networking, storage, observability, and security services.
Experience designing and operating production systems on Kubernetes, ECS/Fargate, or comparable container orchestration platforms.
Proficiency with infrastructure-as-code, especially CloudFormation, Terraform, Helm, or similar tooling.
Strong CI/CD experience with GitHub Actions or similar platforms, including reusable workflows, automated testing, deployment gates, and cloud authentication.
Experience building and operating observability solutions using CloudWatch, Prometheus/Grafana, OpenSearch, or similar tools.
Strong understanding of cloud security practices, IAM, secrets management, least-privilege access, audit logging, and compliance requirements.
Experience supporting distributed systems, microservices, APIs, asynchronous workloads, and multi-environment deployments.
Demonstrated ability to lead technical design, mentor engineers, and influence engineering practices across teams.

Tech Stack

AWS
Azure
Cloud
Distributed Systems
DynamoDB
Google Cloud Platform
Grafana
Kubernetes
Microservices
Prometheus
Terraform

Benefits

health care coverage
retirement savings plans
insurance benefits
Employee Assistance Program
wellness benefits

Staff Platform Engineer, AI/ML Infrastructure

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits