Vida Global is a publicly traded, early-stage AI company building an AI Agent Operating System that helps businesses manage AI workforces. They are seeking a DevOps Engineer to design, build, automate, and operate cloud systems that support their AI platform, ensuring reliability, security, and scalability.

Responsibilities:

Designing and operating production AWS environments across VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager
Building secure multi-VPC networking patterns with hub-and-spoke architecture and route domain isolation
Operating production EKS clusters, platform add-ons, autoscaling, RBAC, namespace boundaries, and upgrade workflows
Managing infrastructure as code with Terraform, Terraform Cloud, reusable modules, and policy-controlled CI/CD authentication
Building observability systems across Prometheus, Grafana, Alertmanager, kube-state-metrics, Node Exporter, and centralized monitoring patterns
Implementing Kubernetes platform services through Helm, including load balancing, autoscaling, external secrets, metrics, and monitoring
Improving platform security through private access patterns, least-privilege IAM, workload identity, SSM-only access, mTLS VPN, encrypted storage, and secure secret delivery
Supporting AI/LLM infrastructure such as LiteLLM, model gateway telemetry, traffic control, autoscaling, and service-level monitoring
Architect, operate, and improve AWS environments supporting Vida’s production platform
Design secure networking patterns across VPCs, EKS clusters, private services, Transit Gateway, ELB/NLB, Route 53, and Global Accelerator
Implement least-privilege IAM, workload identity through IRSA/OIDC, IMDSv2 enforcement, secure access controls, and production-ready AWS security patterns
Use ACM, SSM, ECR, GuardDuty, Secrets Manager, and related AWS services to improve security, reliability, and operational efficiency
Improve cross-region latency, traffic routing, and availability through Anycast, geo-routing, and resilient ingress patterns
Build and maintain production infrastructure using Terraform, HCL, Terraform Cloud remote state, policy controls, and OIDC-based CI authentication
Create and standardize reusable Terraform modules for networking, EKS, observability, security, and platform services
Build reproducible AMIs with Packer, including hardened Docker and runtime dependencies
Automate post-provisioning and configuration management with Ansible, dynamic EC2 inventory, SSM, Jinja2 templates, and multi-play orchestration
Create deterministic Terraform-to-Ansible handoff patterns for infrastructure provisioning and application bootstrap
Automate Helm lifecycle operations and environment-specific configuration rollout
Deploy, operate, and upgrade Amazon EKS clusters using managed node groups and production-safe workflows
Manage core EKS add-ons including VPC CNI, CoreDNS, kube-proxy, pod identity agent, metrics-server, external-secrets, cluster-autoscaler, aws-load-balancer-controller, and kube-prometheus-stack
Implement autoscaling, RBAC boundaries, namespace segmentation, and Kubernetes CRDs such as ServiceMonitor, PodMonitor, ExternalSecret, and ClusterSecretStore
Implement and operate Istio service mesh components, including base, control plane, ingress gateway, sidecar injection, and VirtualService routing policies
Integrate Istio ingress with internal NLB patterns for private service exposure and service-to-service traffic governance
Build and improve observability using Prometheus Operator, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting standards
Design federated monitoring patterns from in-cluster EKS workloads to centralized hub Prometheus infrastructure
Integrate AI platform telemetry, including LiteLLM Prometheus callbacks, ServiceMonitor, model traffic visibility, latency tracking, and runtime anomaly detection
Establish dashboards and alerts for cluster health, workload SLOs, infrastructure saturation, runtime errors, and service reliability
Deploy and operate LiteLLM on Kubernetes using multi-replica architecture, HPA, Istio ingress, secure secret integration, and service-level monitoring
Build scalable AI gateway patterns that support secure model access, traffic control, observability, and high availability
Implement zero-trust infrastructure patterns including private EKS API endpoints, SSM-only operator access, and no public bastion exposure
Enforce mTLS Client VPN with ACM-issued certificates and encrypted gp3 EBS volumes by default
Integrate AWS Secrets Manager with External Secrets Operator for secure Kubernetes secret delivery
Reduce operational risk through identity controls, segmentation, policy-driven access, secure defaults, and automated guardrails
Partner with engineering to improve incident readiness, debugging workflows, runtime visibility, and production change safety

Requirements:

5+ years of experience in DevOps, infrastructure engineering, platform engineering, site reliability engineering, cloud engineering, or a related role
Deep hands-on experience with AWS production environments, especially VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager
Strong experience with Terraform, Terraform Cloud, reusable modules, remote state, CI/CD authentication, and policy-controlled infrastructure workflows
Strong Kubernetes and EKS experience, including managed node groups, core add-ons, Helm, RBAC, autoscaling, namespaces, and production upgrades
Experience operating observability stacks with Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting
Experience implementing security-first cloud patterns, including least-privilege IAM, IRSA/OIDC, IMDSv2, private access, encrypted storage, and secure secret management
Comfort working with service mesh concepts and technologies such as Istio, ingress gateways, VirtualService routing, sidecar injection, and internal load balancer patterns
Strong automation skills using HCL, YAML, Jinja2, Bash, Python, and infrastructure scripting
Ability to troubleshoot complex distributed systems across cloud infrastructure, Kubernetes, networking, observability, and application runtime layers
High ownership, strong judgment, and a bias toward automation, documentation, reliability, and secure defaults
Comfort working in an early-stage company environment where priorities can change quickly and infrastructure work has direct customer impact
Experience building or operating AI/LLM infrastructure, model gateways, LiteLLM, or similar AI platform services
Experience with federated Prometheus architectures or centralized observability across multiple clusters and environments
Experience with Packer-based AMI pipelines and hardened Linux images, especially Ubuntu ARM64
Experience with hub-and-spoke AWS network architecture, Transit Gateway route domain isolation, Global Accelerator, Anycast, or geo-routing
Experience with External Secrets Operator, AWS Secrets Manager, ClusterSecretStore, and Kubernetes-native secret delivery
Experience supporting voice, communications, automation, telephony, reseller, or multi-tenant SaaS platforms
Experience designing production readiness standards, SLOs, incident response practices, and operational runbooks
Interest in AI agents, computer-use automation, and the infrastructure required to operate AI workforces at scale

DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: