Kentik is the network intelligence platform for modern infrastructure teams, aiming to simplify complex network operations. They are seeking a Staff Site Reliability Engineer to enhance their infrastructure, focusing on building self-service components, improving service reliability, and mentoring junior team members.
Responsibilities:
- Build self-service, declarative and API-driven infrastructure components in go, nodejs
- Contribute to our internal deployment tooling (mostly python CLI tools) and service orchestration platform based on Envoy, Nomad and other Hashicorp components
- Help formulate and execute our strategy for datastores such as postgres, kafka, redis (reliability, performance, overhead, capacity planning, …)
- Improve the reliability of our services, with code and testing improvements as well as internal advocacy and education
- Mentoring of junior team members
- Create and update technical documentation for infrastructure
- Be on the on-call escalation path for services owned by the team
Requirements:
- 8+ years of relevant experience
- Passion for building and providing amazing tools and platforms to other engineers
- Strong coding skills in Go or Python (alternatively: server-side javascript, ruby, java …)
- Significant experience with data ecosystems and tools, cloud or on-prem
- An SRE mindset and the intent to build reliable, easy to operate systems
- Familiarity with Temporal (or similar workflow engines) for managing workflow execution and durable execution experience
- Experience with Linux bare-metal hosts managed with puppet