Kentik is the network intelligence platform for modern infrastructure teams. They are seeking a Staff level Site Reliability Engineer (Cloud) to join their Product Engineering team to help build and maintain their Synthetics and Cloud product lines.

Responsibilities:

Make sure our real-time, scalable, infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware, across multiple locations as well as all major cloud vendors
Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth
Deep-diving into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
Assist with expanding our cloud deployments across the major cloud providers
Contribute code, code reviews and tools or patches to all kinds of existing code
Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team

Requirements:

8+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
Expertise in public cloud environments such as AWS, GCP, Azure, or OCI
Strong command of containerization and orchestration using Docker and Kubernetes
Solid programming and automation skills using Bash, Python, or Go
Proficiency with Infrastructure as Code (IaC) and configuration management platforms such as Terraform, Ansible, and Puppet
Proficiency in Linux administration and command-line tools (e.g., SSH, grep, awk)
Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
A passion for documenting code, processes, and infrastructure in runbooks and wikis
Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships
Familiarity with Kubernetes automation tools, specifically managing complex deployments with Helm and Helmfile
Knowledge of scaling Kubernetes workloads and compute infrastructure
Experience optimizing CI/CD build and deploy pipelines using GitHub Actions and Jenkins
Exposure to PagerDuty Integrations
Knowledge of SRE, DevOps and GitOps practices and principles

Staff Site Reliability Engineer, Cloud

Key skills

About this role

Responsibilities:

Requirements: