Vida Global is a publicly traded, early-stage AI company building an AI Agent Operating System that helps businesses manage AI workforces. They are seeking a DevOps Engineer to design, build, automate, and operate cloud systems that support their AI platform, ensuring reliability, security, and scalability.
Responsibilities:
- Designing and operating production AWS environments across VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager
- Building secure multi-VPC networking patterns with hub-and-spoke architecture and route domain isolation
- Operating production EKS clusters, platform add-ons, autoscaling, RBAC, namespace boundaries, and upgrade workflows
- Managing infrastructure as code with Terraform, Terraform Cloud, reusable modules, and policy-controlled CI/CD authentication
- Building observability systems across Prometheus, Grafana, Alertmanager, kube-state-metrics, Node Exporter, and centralized monitoring patterns
- Implementing Kubernetes platform services through Helm, including load balancing, autoscaling, external secrets, metrics, and monitoring
- Improving platform security through private access patterns, least-privilege IAM, workload identity, SSM-only access, mTLS VPN, encrypted storage, and secure secret delivery
- Supporting AI/LLM infrastructure such as LiteLLM, model gateway telemetry, traffic control, autoscaling, and service-level monitoring
- Architect, operate, and improve AWS environments supporting Vida’s production platform
- Design secure networking patterns across VPCs, EKS clusters, private services, Transit Gateway, ELB/NLB, Route 53, and Global Accelerator
- Implement least-privilege IAM, workload identity through IRSA/OIDC, IMDSv2 enforcement, secure access controls, and production-ready AWS security patterns
- Use ACM, SSM, ECR, GuardDuty, Secrets Manager, and related AWS services to improve security, reliability, and operational efficiency
- Improve cross-region latency, traffic routing, and availability through Anycast, geo-routing, and resilient ingress patterns
- Build and maintain production infrastructure using Terraform, HCL, Terraform Cloud remote state, policy controls, and OIDC-based CI authentication
- Create and standardize reusable Terraform modules for networking, EKS, observability, security, and platform services
- Build reproducible AMIs with Packer, including hardened Docker and runtime dependencies
- Automate post-provisioning and configuration management with Ansible, dynamic EC2 inventory, SSM, Jinja2 templates, and multi-play orchestration
- Create deterministic Terraform-to-Ansible handoff patterns for infrastructure provisioning and application bootstrap
- Automate Helm lifecycle operations and environment-specific configuration rollout
- Deploy, operate, and upgrade Amazon EKS clusters using managed node groups and production-safe workflows
- Manage core EKS add-ons including VPC CNI, CoreDNS, kube-proxy, pod identity agent, metrics-server, external-secrets, cluster-autoscaler, aws-load-balancer-controller, and kube-prometheus-stack
- Implement autoscaling, RBAC boundaries, namespace segmentation, and Kubernetes CRDs such as ServiceMonitor, PodMonitor, ExternalSecret, and ClusterSecretStore
- Implement and operate Istio service mesh components, including base, control plane, ingress gateway, sidecar injection, and VirtualService routing policies
- Integrate Istio ingress with internal NLB patterns for private service exposure and service-to-service traffic governance
- Build and improve observability using Prometheus Operator, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting standards
- Design federated monitoring patterns from in-cluster EKS workloads to centralized hub Prometheus infrastructure
- Integrate AI platform telemetry, including LiteLLM Prometheus callbacks, ServiceMonitor, model traffic visibility, latency tracking, and runtime anomaly detection
- Establish dashboards and alerts for cluster health, workload SLOs, infrastructure saturation, runtime errors, and service reliability
- Deploy and operate LiteLLM on Kubernetes using multi-replica architecture, HPA, Istio ingress, secure secret integration, and service-level monitoring
- Build scalable AI gateway patterns that support secure model access, traffic control, observability, and high availability
- Implement zero-trust infrastructure patterns including private EKS API endpoints, SSM-only operator access, and no public bastion exposure
- Enforce mTLS Client VPN with ACM-issued certificates and encrypted gp3 EBS volumes by default
- Integrate AWS Secrets Manager with External Secrets Operator for secure Kubernetes secret delivery
- Reduce operational risk through identity controls, segmentation, policy-driven access, secure defaults, and automated guardrails
- Partner with engineering to improve incident readiness, debugging workflows, runtime visibility, and production change safety
Requirements:
- 5+ years of experience in DevOps, infrastructure engineering, platform engineering, site reliability engineering, cloud engineering, or a related role
- Deep hands-on experience with AWS production environments, especially VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager
- Strong experience with Terraform, Terraform Cloud, reusable modules, remote state, CI/CD authentication, and policy-controlled infrastructure workflows
- Strong Kubernetes and EKS experience, including managed node groups, core add-ons, Helm, RBAC, autoscaling, namespaces, and production upgrades
- Experience operating observability stacks with Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting
- Experience implementing security-first cloud patterns, including least-privilege IAM, IRSA/OIDC, IMDSv2, private access, encrypted storage, and secure secret management
- Comfort working with service mesh concepts and technologies such as Istio, ingress gateways, VirtualService routing, sidecar injection, and internal load balancer patterns
- Strong automation skills using HCL, YAML, Jinja2, Bash, Python, and infrastructure scripting
- Ability to troubleshoot complex distributed systems across cloud infrastructure, Kubernetes, networking, observability, and application runtime layers
- High ownership, strong judgment, and a bias toward automation, documentation, reliability, and secure defaults
- Comfort working in an early-stage company environment where priorities can change quickly and infrastructure work has direct customer impact
- Experience building or operating AI/LLM infrastructure, model gateways, LiteLLM, or similar AI platform services
- Experience with federated Prometheus architectures or centralized observability across multiple clusters and environments
- Experience with Packer-based AMI pipelines and hardened Linux images, especially Ubuntu ARM64
- Experience with hub-and-spoke AWS network architecture, Transit Gateway route domain isolation, Global Accelerator, Anycast, or geo-routing
- Experience with External Secrets Operator, AWS Secrets Manager, ClusterSecretStore, and Kubernetes-native secret delivery
- Experience supporting voice, communications, automation, telephony, reseller, or multi-tenant SaaS platforms
- Experience designing production readiness standards, SLOs, incident response practices, and operational runbooks
- Interest in AI agents, computer-use automation, and the infrastructure required to operate AI workforces at scale