Stellar Cyber is a fast-growing global leader in cybersecurity, trusted by major enterprises and government agencies. They are seeking a highly skilled Senior DevOps / Site Reliability Engineer to build, operate, and scale reliable cloud-native infrastructure and distributed data platforms, while driving automation and operational excellence.

Responsibilities:

Administer and maintain Kubernetes clusters and containerized workloads
Manage cloud infrastructure across OCI, AWS, GCP, or Azure environments
Develop and maintain CI/CD pipelines for reliable application deployments
Implement and manage Infrastructure as Code (IaC) using Terraform and Helm
Build automation tooling and operational workflows using Python, Go, or Bash
Drive observability initiatives including monitoring, logging, tracing, and alerting improvements
Monitor, troubleshoot, and resolve production incidents while participating in on-call rotations
Support and optimize distributed data platforms including Kafka, Elasticsearch, Spark, Redis, and MongoDB
Improve platform reliability, scalability, and operational efficiency using SRE best practices
Collaborate with cross-functional teams across multiple time zones
Perform Linux system administration and networking troubleshooting
Contribute to incident response processes, postmortems, and reliability improvements
Support GitOps and deployment workflows using tools such as ArgoCD and GitHub Actions
Evaluate and implement AI-assisted operational tooling for auto-remediation, alert correlation, and operational intelligence

Requirements:

5+ years of experience in DevOps, SRE, or Platform Engineering roles
Strong expertise with Kubernetes, Docker, and container orchestration
Hands-on experience managing production cloud environments
Strong Infrastructure as Code experience with Terraform and Helm
Experience with CI/CD tools and deployment automation
Advanced troubleshooting skills in Linux systems, networking, and distributed systems
Experience with observability platforms including Prometheus, Grafana, Loki, Alertmanager, and Elastic Stack
Strong programming and scripting skills in Python, Bash, or Go
Experience supporting high-availability production systems and on-call operations
Knowledge of incident management and reliability engineering practices
Familiarity with data platform technologies such as Kafka, Spark, Elasticsearch, Redis, or MongoDB
Understanding of AI-driven operational tooling and automated remediation concepts
Excellent communication, collaboration, and problem-solving skills
Resides on the East Coast

Senior DevOps Engineer/Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: