Own and evolve the edge proxy platform: Maintain, upgrade, and extend a high-performance reverse proxy — including maintaining the proxy binary and its configuration tooling, writing Go and Python automation, managing the full container image lifecycle on hardened Linux base images, and working across the broader edge layer, including CDN, WAF, and traffic management capabilities.
Build and maintain cloud infrastructure as code: Design and implement Terraform/Terragrunt modules and live environment configurations managing EKS clusters, load balancers, IAM roles, VPC networking, ECR registries, and supporting AWS services across multiple regions including GovCloud.
Operate Kubernetes clusters at scale: Manage multi-region, multi-cluster EKS deployments via FluxCD GitOps workflows and Helm charts, including node AMI rotation, add-on lifecycle management, and horizontal pod autoscaling.
Build and own CI/CD pipelines: Design, maintain, and improve shared GitLab CI/CD pipeline templates used across all team repositories; build and operate alternative pipeline workflows for isolated government cloud environments.
Automate operational toil: Build and maintain tooling for tasks such as container image patching, EKS AMI rotation, air-gapped ECR image sync to GovCloud, and automated MR creation for monthly version-bump patching cycles.
Manage observability and on-call: Provision and maintain Datadog SLOs, monitors, and dashboards via Terraform; participate in the team's on-call rotation responding to edge proxy incidents across production and GovCloud environments.
Support FedRAMP/GovCloud operations: Operate the GovCloud environment with its unique constraints — air-gapped image distribution, infrastructure automation in isolated networks, and alert management with compliance-aware data handling.
Evaluate and adopt internal developer tooling: Research, prototype, and drive the adoption of internal tools that improve engineering productivity across the company — including developer portals, platform self-service capabilities, and other tooling that raises the bar for the developer experience at Smartsheet.
Mentor and collaborate: Share knowledge across the team through code reviews, architecture discussions, and runbook authorship; foster a culture of engineering excellence and operational rigour.
Strategically apply AI tools: Strategically apply and champion AI tools within your team's domain to improve project execution, infrastructure design, quality, and debugging, leading adoption of AI best practices.
Requirements
5+ years of experience in DevOps, platform engineering, or site reliability engineering.
A BS or MS in Computer Science, Engineering, or a related field, or equivalent industry experience.
Deep proficiency with Terraform and Terragrunt for managing production cloud infrastructure at scale across multiple environments and regions.
Strong Kubernetes expertise, including EKS cluster operations and Helm chart authoring.
Hands-on experience with AWS networking and container workload services: EKS, ALB/NLB, VPC, IAM, ECR, Route53, CloudWatch, and EventBridge.
Proficiency in at least one general-purpose programming language — Go or Python preferred — for building operational tooling and automation.
Solid understanding of reverse proxies, API gateways, or load balancers (NGINX, HAProxy, or equivalent).
Experience designing and maintaining CI/CD pipelines (GitLab CI preferred), including shared template libraries across multiple repositories.
Experience with container image security practices: hardened base images, vulnerability scanning, and image promotion workflows.
Strong operational instincts: comfort with on-call responsibilities, incident response, runbook authorship, and postmortems in production environments.
1 year of professional experience leveraging AI-based workflows to author, maintain, review, and deploy infrastructure or code.
Fluency in English is required.
Legally eligible to work in Bulgaria on an ongoing basis.