Aerospike is the real-time database for mission-critical use cases and workloads, including machine learning and AI. They are seeking a Senior Software Engineer to join their Cloud team and design and build infrastructure orchestration and operational systems for Aerospike Cloud, impacting the reliability and scalability of production database clusters.
Responsibilities:
- Design, implement, and maintain components and workloads responsible for provisioning and managing Kubernetes-based Aerospike clusters
- Build and evolve workflows used to orchestrate long-running infrastructure and database lifecycle operations
- Develop Kubernetes-native systems using controllers, operators, and CRDs
- Own cloud infrastructure automation using Terraform, including VPCs, EKS clusters, IAM, storage, and networking
- Design and maintain persistent storage lifecycles involving EBS volumes, local NVMe instance storage, and backups
- Diagnose and resolve complex production issues spanning Kubernetes, cloud provider APIs, and distributed systems
- Improve observability through metrics, logs, and alerts using Prometheus, OpenTelemetry, and Datadog
- Collaborate across teams to evolve architecture, improve reliability, and prevent operational regressions
Requirements:
- At least 5 years of relevant software engineering experience
- Strong foundation in computer science, distributed systems, and debugging complex systems
- Proficiency in at least one statically typed backend language (preferably Go)
- Experience developing and operating distributed systems in production
- Hands-on experience with Kubernetes and containerized workloads
- Experience with at least one major cloud provider (AWS preferred)
- Experience designing, deploying, and operating stateful systems
- Familiarity with Git-based workflows and CI/CD pipelines
- Proficiency in Go
- Experience with Terraform and infrastructure-as-code
- Experience with Kubernetes Operators, controllers, or CRDs
- Experience with workflow orchestration systems (Temporal, Cadence, Airflow, etc.)
- Strong understanding of cloud networking concepts (VPCs, subnets, IP management, load balancers)
- Experience with observability stacks (Prometheus, OpenTelemetry, Datadog)
- Experience operating storage systems (EBS, instance store, backups)