WorkOS builds tools and services for developers to help them implement authentication and overall enterprise readiness. As a Database Reliability Engineer, you'll be responsible for the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure, ensuring data durability and scalability as the company grows.

Responsibilities:

Own the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure
Analyze and implement best practices for our database clusters, including replication, connection pooling, high availability, and disaster recovery
Build and maintain observability for database metrics (query performance, replication lag, connection saturation, storage growth) and ensure we meet our database SLOs
Provide database expertise to product engineering teams through migration reviews, query optimization guidance, and schema design consultation
Develop automation and self-service tooling that enables engineers to safely interact with databases without bottlenecking on the DBRE team
Participate in on-call rotations and lead incident response for database-related production issues, performing root cause analysis and implementing permanent fixes
Plan and manage database capacity, forecasting growth and ensuring our infrastructure can handle increased workloads
Collaborate with SREs to roll out infrastructure changes to production environments, with a focus on minimizing risk to the data layer
Document operational procedures, runbooks, and architectural decisions so learnings become repeatable actions and eventually automation
Drive improvements to backup and recovery strategies, regularly testing and validating disaster recovery procedures

Requirements:

5+ years of experience running PostgreSQL in production at scale, with strong knowledge of internals (WAL, MVCC, vacuum tuning, query planner, indexing, replication)
Solid software engineering skills. You write production-quality code, not just scripts. Experience with Python, Go, Ruby, or similar languages
Experience with infrastructure-as-code and configuration management (Terraform, Ansible, Chef, or similar)
Strong SQL skills and the ability to review and optimize complex queries for high-throughput, low-latency environments
Experience with database high-availability patterns: streaming replication, connection pooling (PgBouncer), failover automation (Patroni or similar)
Familiarity with cloud database services on AWS (RDS, Aurora, DynamoDB, ElastiCache) or equivalent platforms
Experience with monitoring and observability tools (Datadog, Prometheus, Grafana, or similar) applied to database workloads
Comfort with on-call responsibilities and a track record of effective incident response
Strong written and verbal communication skills. You document your work and share context proactively
A proactive, ownership-driven mindset. When you see something broken, you fix it. When you see a pattern of toil, you automate it
Experience with other data stores beyond PostgreSQL (Redis, DynamoDB, ClickHouse, Elasticsearch)
Familiarity with Ruby on Rails or Django and how ORMs interact with the database layer
Experience with database migration tooling and blue-green or zero-downtime migration strategies
Contributions to open-source database tooling or the PostgreSQL ecosystem
Background in security-sensitive environments, particularly around data encryption, access controls, and compliance requirements

Database Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: