Knock is a remote-first startup focused on enhancing product notifications for better user experiences. They are seeking a DevOps Engineer to join their platform team, responsible for building and maintaining core services and infrastructure, with a strong emphasis on reliability and performance.
Responsibilities:
- Adopting a Terraform-backed EKS cluster, modernizing & maintaining it for elastic scale, reliability, performance, security, etc
- Going deep into troubleshooting Postgres performance, queues of every shape and size, and come out the other side with a plan for scaling another 10x to 100x
- Identifying and correcting scaling issues before they affect our customers by relying on and improving our telemetry and traces in Datadog, AWS Cloudwatch, and Honeycomb. If you see a blind spot, you are comfortable getting into the codebase to fix it
- Maintaining and improve upon our >99.95% uptime track record
- Supporting our product engineering team at moving fast to deliver customer value. Improving the day-to-day developer experience through canaries, faster cycle time, blue/green deploys, etc
- Joining on-call rotations on a schedule with the rest of the engineering team
Requirements:
- 4+ years experience as a DevOps engineer or similar in a startup or mid-sized company working with complex systems that operate at scale
- Experience working in and on production Kubernetes clusters using infrastructure as code (we use Terraform, but others like Pulumi or Cloudformation are fine too)
- Experience working on complex AWS deployments (multi-account, complex VPC structure to support EKS, EKS experience)
- Experience operating and scaling different database technologies. We use Aurora Postgres, Mongo, and ClickHouse so significant experience with at least one of these is a must
- Some past experience or familiarity operating and scaling different queues and streams across SQS, Kinesis, Kafka or similar
- Strong problem-solving skills with a focus on reliability, scalability, and performance
- Strong communications skills, with the ability to work in a fully distributed, remote-first team