Scribe is seeking a Staff Database Reliability Engineer to lead the strategy, architecture, and operational excellence of their data infrastructure. The role involves owning the data tier end-to-end, ensuring efficient database interactions, and driving major infrastructure initiatives while maintaining high performance and reliability standards.
Responsibilities:
- Own the data tier end-to-end
- Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases
- Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints
- Catch N+1 patterns and missing select_related/prefetch_related in review
- Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)
- Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge
- Lead major infrastructure initiatives:
- Capacity planning as traffic and engineering throughput grow
- Zero-downtime schema migrations and cutovers
- Multi-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs
- Backups, PITR, failover testing, retention
- Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake):
- DMS task design and tuning, replication slot hygiene on the Postgres side
- Schema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM
- Parquet layout and partitioning, reliability of the Snowflake handoff
- Automated checks that flag migrations likely to break downstream consumers
- Drive observability across three complementary tools:
- Pganalyze — query-level performance, index advisor, schema insights - the go-to for "why is this ORM query slow"
- CloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMS
- Honeycomb — high-cardinality tracing that ties slow DB calls back to users, flags, deploys, and flows
- Shape how the three fit together, including Django-side instrumentation and trace attributes on ORM queries
- Build tooling and guardrails:
- Migration review automation and CI checks for risky patterns
- Slow query pipelines fed from pganalyze
- Self-service dashboards so teams understand their own query footprint
- Support and evolve the rest of the stack:
- OpenSearch — index design, sharding, mapping changes, reindexing strategy, Django-side indexing pipelines
- Redis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herds
- SQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under load
Requirements:
- Deep PostgreSQL — EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum
- Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)
- Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) — predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueries
- Single-region multi-AZ design — practical understanding of what it does and doesn't protect against
- Production CDC experience, ideally AWS DMS — comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift)
- Hands-on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) — comfortable with OpenTelemetry and opinionated about what makes a trace useful
- Real experience making AI coding and review tools useful for a team — writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configs
- OpenSearch at scale — sizing, sharding, JVM tuning, rolling upgrades, snapshots
- Production Redis — persistence tradeoffs, cluster mode, hot keys, thundering herds
- At least one production message broker (SQS, RabbitMQ, Kafka) — delivery semantics, idempotency, failure modes
- Strong automation and IaC background — real code (Python, Go, or similar) and Terraform
- Track record leading cross-team initiatives, writing design docs that hold up, influencing without authority
- Comfortable in a high-growth environment where the right answer for 50 engineers isn't the right answer for 100
- Pragmatic outlook during incidents — focused on preventing the next one