Close is a bootstrapped, profitable company with a fully remote team focused on building a user-friendly CRM for small scaling businesses. The Site Reliability Engineer will join the Infrastructure Team to maintain and enhance the platform that supports all Close systems, ensuring stability and performance of critical applications.
Responsibilities:
- Build and maintain the platform that runs all Close systems
- Work with multi-terabyte MongoDB, PostgreSQL, and Elasticsearch clusters
- Manage telemetry systems built on Grafana’s LGTM stack and ClickHouse processing over 130 TB per month
- Oversee multiple Kubernetes clusters running tens of thousands of pods
- Utilize Github Actions & ArgoCD powered CI/CD for quick deployment and rollback
- Ensure systems are stable, up to date, and have not needed scheduled downtime in 4 years
- Fully automate database lifecycles with Argo Workflow
- Eliminate all static credentials where they may be found
- Reduce downtime and disruption due to maintenance or disaster
- Improve multi-region disaster recovery system
Requirements:
- 5+ years of experience building modern infrastructure systems for Senior 1 & 2 level candidates; 8+ years for Staff level candidates
- Respected as an expert on the systems you run
- Final point of escalation in the support of mission critical production systems
- Familiar with some of the following technologies: AWS, Terraform, Kubernetes, Ansible, MongoDB, PostgreSQL, Elasticsearch
- Strong grasp of common networking and data transfer protocols such as DNS, HTTP, TCP
- Able to speak and write in English
- Located in the USA (ET, CT, MT, PT)
- Contributed open source code related to our tech stack
- Experience maintaining very large databases
- Has been through a successful disaster response
- Experience with multi-region architectures
- Run MLOps systems
- Experience scaling Temporal