Future Secure AI is a pioneering company focused on building innovative AI solutions to tackle real-world problems for global enterprises. They are seeking a Site Reliability Engineer to design, build, and operate the platforms that support AI Co-Workers, ensuring system reliability and collaborating with various teams.
Responsibilities:
- Design, build, and operate reliable production infrastructure supporting AI Co‑Workers
- Own Kubernetes‑based platforms used to deploy and run AI workloads
- Build and maintain infrastructure as code using Terraform
- Implement and maintain Helm‑based deployment workflows
- Define, measure, and improve system reliability using SLIs, SLOs, and SLAs
- Participate in on‑call rotation, incident response, root cause analysis, and post‑mortems
- Reduce operational toil through automation and engineering improvements
- Build and improve observability across monitoring, logging, and alerting
- Partner closely with engineers to ensure systems are resilient, scalable, and secure
- Operate across build, deploy, and operate phases of the software lifecycle
Requirements:
- Hands‑on Kubernetes experience designing, building, or operating workloads on EKS, AKS, GKE, or self‑managed Kubernetes
- Hands‑on Terraform experience for infrastructure provisioning and automation
- Hands‑on Helm experience for Kubernetes application deployment
- Professional experience using at least two programming or scripting languages such as Python, Go, Java, Bash, PowerShell, or Ruby
- Direct Site Reliability Engineer experience or equivalent, including reliability engineering, on‑call, incident response, post‑mortems, and toil reduction
- Experience working within a defined SDLC, including CI/CD, release processes, and end‑to‑end delivery from design to operations
- Hands‑on experience with at least one major cloud provider such as AWS, Azure, or Google Cloud
- Experience with ArgoCD or GitOps‑style deployment approaches
- Five or more years of relevant professional experience
- DevOps or DevSecOps experience, including CI/CD ownership, infrastructure automation, and security considerations
- Relevant certifications such as CKA, CKAD, cloud certifications, DevOps, DevSecOps, or programming credentials