Basis Theory is a company that offers a fully programmable vault for managing payment data and compliance. The Senior Site Reliability Engineer will ensure systems are reliable and measurable, lead efforts to improve performance, and collaborate with teams to foster a metrics-first culture.
Responsibilities:
- Hands-on member of engineering, with a focus on reliability, performance, and observability
- Work closely with Principal Engineers and CTO to define SLIs, SLOs, and error budgets for key systems
- Leading cost optimization efforts by improving our use of metrics vs. logs, right-sizing trace sampling, tuning ingestion/indexing, and exploring AWS-native monitoring alternatives
- Building and improving tooling for local and automated performance testing, and tracking benchmarks over time to identify bottlenecks
- Driving deployment safety and canary rollouts, using UAT as a testbed, and creating feedback loops that automatically assess rollout success
- Leading chaos and resilience testing, including monthly tabletop exercises, failover drills, and continuous verification of redundancy assumptions
- Partnering with Engineering to evolve scaling patterns (autoscaling, architectures, etc), including proactive action when new features or metrics reveal risk
Requirements:
- Production experience in cloud infrastructure and observability (AWS, Terraform, Kubernetes)
- Strong systems and debugging skills across the stack (networking, services, data)
- Experience designing and monitoring SLIs/SLOs, and reducing alert noise
- Ability to write code in one or more backend languages (Go, Python, or Node.js)
- Experience with CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD)
- Experience optimizing observability spend and tuning DataDog, Prometheus, or similar
- Experience with chaos engineering, progressive deployments, and auto-remediation
- Exposure to high-throughput, latency-sensitive, or globally distributed systems