Stellar Cyber is a fast-growing global leader in cybersecurity, trusted by major enterprises and government agencies. They are seeking a highly skilled Senior DevOps / Site Reliability Engineer to build, operate, and scale reliable cloud-native infrastructure and distributed data platforms, while driving automation and operational excellence.
Responsibilities:
- Administer and maintain Kubernetes clusters and containerized workloads
- Manage cloud infrastructure across OCI, AWS, GCP, or Azure environments
- Develop and maintain CI/CD pipelines for reliable application deployments
- Implement and manage Infrastructure as Code (IaC) using Terraform and Helm
- Build automation tooling and operational workflows using Python, Go, or Bash
- Drive observability initiatives including monitoring, logging, tracing, and alerting improvements
- Monitor, troubleshoot, and resolve production incidents while participating in on-call rotations
- Support and optimize distributed data platforms including Kafka, Elasticsearch, Spark, Redis, and MongoDB
- Improve platform reliability, scalability, and operational efficiency using SRE best practices
- Collaborate with cross-functional teams across multiple time zones
- Perform Linux system administration and networking troubleshooting
- Contribute to incident response processes, postmortems, and reliability improvements
- Support GitOps and deployment workflows using tools such as ArgoCD and GitHub Actions
- Evaluate and implement AI-assisted operational tooling for auto-remediation, alert correlation, and operational intelligence
Requirements:
- 5+ years of experience in DevOps, SRE, or Platform Engineering roles
- Strong expertise with Kubernetes, Docker, and container orchestration
- Hands-on experience managing production cloud environments
- Strong Infrastructure as Code experience with Terraform and Helm
- Experience with CI/CD tools and deployment automation
- Advanced troubleshooting skills in Linux systems, networking, and distributed systems
- Experience with observability platforms including Prometheus, Grafana, Loki, Alertmanager, and Elastic Stack
- Strong programming and scripting skills in Python, Bash, or Go
- Experience supporting high-availability production systems and on-call operations
- Knowledge of incident management and reliability engineering practices
- Familiarity with data platform technologies such as Kafka, Spark, Elasticsearch, Redis, or MongoDB
- Understanding of AI-driven operational tooling and automated remediation concepts
- Excellent communication, collaboration, and problem-solving skills
- Resides on the East Coast