WorkOS builds tools and services for developers to help them implement authentication and overall enterprise readiness. As a Database Reliability Engineer, you'll be responsible for the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure, ensuring data durability and scalability as the company grows.
Responsibilities:
- Own the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure
- Analyze and implement best practices for our database clusters, including replication, connection pooling, high availability, and disaster recovery
- Build and maintain observability for database metrics (query performance, replication lag, connection saturation, storage growth) and ensure we meet our database SLOs
- Provide database expertise to product engineering teams through migration reviews, query optimization guidance, and schema design consultation
- Develop automation and self-service tooling that enables engineers to safely interact with databases without bottlenecking on the DBRE team
- Participate in on-call rotations and lead incident response for database-related production issues, performing root cause analysis and implementing permanent fixes
- Plan and manage database capacity, forecasting growth and ensuring our infrastructure can handle increased workloads
- Collaborate with SREs to roll out infrastructure changes to production environments, with a focus on minimizing risk to the data layer
- Document operational procedures, runbooks, and architectural decisions so learnings become repeatable actions and eventually automation
- Drive improvements to backup and recovery strategies, regularly testing and validating disaster recovery procedures
Requirements:
- 5+ years of experience running PostgreSQL in production at scale, with strong knowledge of internals (WAL, MVCC, vacuum tuning, query planner, indexing, replication)
- Solid software engineering skills. You write production-quality code, not just scripts. Experience with Python, Go, Ruby, or similar languages
- Experience with infrastructure-as-code and configuration management (Terraform, Ansible, Chef, or similar)
- Strong SQL skills and the ability to review and optimize complex queries for high-throughput, low-latency environments
- Experience with database high-availability patterns: streaming replication, connection pooling (PgBouncer), failover automation (Patroni or similar)
- Familiarity with cloud database services on AWS (RDS, Aurora, DynamoDB, ElastiCache) or equivalent platforms
- Experience with monitoring and observability tools (Datadog, Prometheus, Grafana, or similar) applied to database workloads
- Comfort with on-call responsibilities and a track record of effective incident response
- Strong written and verbal communication skills. You document your work and share context proactively
- A proactive, ownership-driven mindset. When you see something broken, you fix it. When you see a pattern of toil, you automate it
- Experience with other data stores beyond PostgreSQL (Redis, DynamoDB, ClickHouse, Elasticsearch)
- Familiarity with Ruby on Rails or Django and how ORMs interact with the database layer
- Experience with database migration tooling and blue-green or zero-downtime migration strategies
- Contributions to open-source database tooling or the PostgreSQL ecosystem
- Background in security-sensitive environments, particularly around data encryption, access controls, and compliance requirements