Network to Code is dedicated to pioneering network automation technologies. As a Site Reliability Engineer on the Nautobot Cloud Engineering team, you will operate, support, and evolve customer environments in AWS while ensuring uptime, performance, and security.
Responsibilities:
- Operate and support Nautobot Cloud deployments in AWS, including EKS, EC2, RDS, and associated services
- Use Jira to manage operational and project-related tasks, track incidents, and document changes
- Support resolution of escalated issues related to other Kubernetes-like, including AKS or on-prem, customers as needed
- Deploy and update Nautobot instances using Helm charts, Kubernetes manifests, and automation workflows
- Automate improvements to CI/CD pipelines (GitHub Actions, Terraform, Ansible) for provisioning, upgrades, and configuration management
- Maintain observability tools (Prometheus, Loki, Grafana) to ensure accurate monitoring, alerting, and logging
- Troubleshoot application and infrastructure issues across containerized environments
- Collaborate with engineers across Cloud Operations, Nautobot Core, and Nautobot Apps teams to deliver cross-functional solutions
- Contribute to documentation for operational runbooks, troubleshooting guides, and architecture diagrams
- Participate in Agile ceremonies, including standups and retrospectives
Requirements:
- 3–5 years of experience applying DevOps or SRE practices to production systems
- 2+ years experience operating workloads in AWS, with a focus on EKS, EC2, IAM, and networking
- 2+ years working with Kubernetes (preferably in production) and Helm
- Experience with IaC tools such as Terraform and configuration management tools like Ansible
- Familiarity with CI/CD pipelines (GitHub Actions, Jenkins, CircleCI, etc.)
- Proficiency in scripting languages such as Python or Bash
- Comfortable working in Linux-based environments
- Familiarity with monitoring, logging, and alerting solutions (Prometheus, Loki, Grafana, Datadog, ELK)
- Skilled in using Jira to manage operational tasks, incident response, sprint planning, and project tracking
- Analytical and troubleshooting skills using k9s for real-time Kubernetes management and Terraform for diagnosing and resolving Infrastructure-as-Code deployment issues
- Networking fundamentals (equivalent to CCNA-level understanding)
- Passion for reliability, customer success, and operational excellence
- Ability to troubleshoot complex distributed systems and quickly identify root causes
- Strong communication skills—able to clearly convey technical concepts to both peers and customers
- A proactive mindset, looking for opportunities to improve processes and prevent issues before they occur
- Flexibility to adapt to changing priorities and technologies