Basis Theory is a company that offers a fully programmable vault for managing payment data and compliance. The Senior Site Reliability Engineer will ensure systems are reliable and measurable, lead efforts to improve performance, and collaborate with teams to foster a metrics-first culture.

Responsibilities:

Hands-on member of engineering, with a focus on reliability, performance, and observability
Work closely with Principal Engineers and CTO to define SLIs, SLOs, and error budgets for key systems
Leading cost optimization efforts by improving our use of metrics vs. logs, right-sizing trace sampling, tuning ingestion/indexing, and exploring AWS-native monitoring alternatives
Building and improving tooling for local and automated performance testing, and tracking benchmarks over time to identify bottlenecks
Driving deployment safety and canary rollouts, using UAT as a testbed, and creating feedback loops that automatically assess rollout success
Leading chaos and resilience testing, including monthly tabletop exercises, failover drills, and continuous verification of redundancy assumptions
Partnering with Engineering to evolve scaling patterns (autoscaling, architectures, etc), including proactive action when new features or metrics reveal risk

Requirements:

Production experience in cloud infrastructure and observability (AWS, Terraform, Kubernetes)
Strong systems and debugging skills across the stack (networking, services, data)
Experience designing and monitoring SLIs/SLOs, and reducing alert noise
Ability to write code in one or more backend languages (Go, Python, or Node.js)
Experience with CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD)
Experience optimizing observability spend and tuning DataDog, Prometheus, or similar
Experience with chaos engineering, progressive deployments, and auto-remediation
Exposure to high-throughput, latency-sensitive, or globally distributed systems

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: