Kentik is the network intelligence platform for modern infrastructure teams. They are seeking a Staff level Site Reliability Engineer (Cloud) to join their Product Engineering team to help build and maintain their Synthetics and Cloud product lines.
Responsibilities:
- Make sure our real-time, scalable, infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware, across multiple locations as well as all major cloud vendors
- Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth
- Deep-diving into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
- Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
- Assist with expanding our cloud deployments across the major cloud providers
- Contribute code, code reviews and tools or patches to all kinds of existing code
- Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
- Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team
Requirements:
- 8+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
- Expertise in public cloud environments such as AWS, GCP, Azure, or OCI
- Strong command of containerization and orchestration using Docker and Kubernetes
- Solid programming and automation skills using Bash, Python, or Go
- Proficiency with Infrastructure as Code (IaC) and configuration management platforms such as Terraform, Ansible, and Puppet
- Proficiency in Linux administration and command-line tools (e.g., SSH, grep, awk)
- Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
- Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
- A passion for documenting code, processes, and infrastructure in runbooks and wikis
- Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
- Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships
- Familiarity with Kubernetes automation tools, specifically managing complex deployments with Helm and Helmfile
- Knowledge of scaling Kubernetes workloads and compute infrastructure
- Experience optimizing CI/CD build and deploy pipelines using GitHub Actions and Jenkins
- Exposure to PagerDuty Integrations
- Knowledge of SRE, DevOps and GitOps practices and principles