Maintain, optimize, and scale our hybrid Kubernetes (RKE2) edge clusters to ensure maximum uptime, smooth local data extraction, and seamless scaling across customer facilities.
Remediate and manage container security and base image vulnerabilities to harden our edge-to-cloud infrastructure against security threats and maintain strict compliance standards.
Define, implement, and monitor pipeline Service Level Indicators and Service Level Objectives within Splunk to proactively identify bottlenecks and guarantee the health, freshness, and availability of our hybrid data streams.
Integrate and automate testing, security scanning, and deployment stages into GitHub Actions to increase deployment velocity, eliminate manual errors, and shorten feedback loops.
Architect, build, and deploy robust AWS CloudFormation stacks to provide highly available, repeatable, and scalable cloud database and data ingestion infrastructure.
Lead and execute the technical configuration and validation of on-premise infrastructure and logical replication pipelines to accelerate the onboarding timeline and ensure zero data loss when standing up new customer facilities.
Requirements
5+ years of experience in a DevOps, SRE, or Systems Engineering role managing production-grade infrastructure.
Strong experience managing Kubernetes environments (bonus points for edge/on-prem distributions like RKE2 or Rancher) and configuration management using Kustomize.
Deep familiarity with AWS services, specifically infrastructure provisioning via CloudFormation and managing data/compute resources (EC2, Lambda, RDS, DynamoDB, Firehose).
Hands-on experience configuring log aggregation and metrics dashboards using Splunk, Cloudwatch, or similar.
Practical experience with container security scanning, patching, and maintaining a secure software supply chain.
Proficiency in Python, NodeJS, or Ansible to write automation scripts or troubleshoot pipeline friction.
Understanding of networking concepts required to support hybrid environments (e.g., local APIs connecting to OT systems, data replication to the cloud).
Tech Stack
Ansible
AWS
Cloud
DynamoDB
EC2
Kubernetes
Node.js
Python
Splunk
Benefits
Equity
Medical, Life, Short-Term Disability, and AD&D insurance