Scribe is a leading Workflow AI platform used by 94% of the Fortune 500. They are seeking a Senior Database Reliability Engineer to ensure the reliability, performance, and scalability of their data tier, with significant ownership over the engineering processes and standards.
Responsibilities:
- Own database reliability across Aurora, OpenSearch, Redis, and our CDC pipeline — including schema design reviews, migration safety (locks, backfills, concurrent index builds, NOT VALID constraints), and incident response for the data tier
- Make the Django ORM a strength at scale: catch N+1 patterns in review, extend
QuerySet conventions and physical schema standards, and build the CI checks and AGENTS.md scaffolding that encode those standards so they scale beyond any single reviewer
- Operate and evolve the CDC pipeline from Aurora through DMS to S3 Parquet to Snowflake – including replication slot hygiene, schema evolution safety, and automated checks that catch migrations likely to break downstream consumers before they ship
- Build and improve observability across pganalyze, CloudWatch, and Honeycomb, with Django-side instrumentation that ties slow ORM queries back to specific users, flags, and deploys
- Drive multi-AZ resilience within our single-region architecture — Aurora writer/reader placement, failover behavior, RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability
- Build self-service tooling and dashboards that give product and platform teams visibility into their own query footprint, reducing the review burden as the engineering org grows
- Contribute to onboarding and knowledge-sharing as a large incoming class of engineers joins — write docs, run internal sessions on "what your ORM query is really doing," and feed that knowledge back into AI review tooling
Requirements:
- Has deep PostgreSQL expertise in practice: reads
EXPLAIN (ANALYZE, BUFFERS) fluently, understands MVCC, bloat, lock contention, and vacuum behavior, and can tune Aurora Serverless V2 for latency and throughput
- Has worked with an ORM (Django, SQLAlchemy, ActiveRecord, or similar) at production scale – can predict the SQL a query generates, spot N+1 issues on sight, and knows when joins beat batched IN queries and when they don't
- Has run CDC pipelines in production, ideally with AWS DMS — comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake, BigQuery, or Redshift
- Has hands-on experience with pganalyze (or Datadog DBM /
pg_stat_statements pipelines), CloudWatch, and Honeycomb (or another high-cardinality tracing tool); comfortable with OpenTelemetry
- Has worked with OpenSearch, Redis, and at least one production message broker (SQS, RabbitMQ, or Kafka) at scale
- Writes real automation — Python, Go, or similar — and has used Terraform or comparable IaC to manage infrastructure
- Has used AI coding and review tools in a team setting: written or maintained
AGENTS.md files, configured review agents, iterated on prompts
- Event sourcing on Postgres, or experience with alternate CDC tooling (Debezium, Fivetran, Airbyte)
pgbouncer or RDS Proxy at scale with Django connection handling
- Deep Honeycomb usage: SLOs, BubbleUp, Triggers, derived columns
- Snowflake from the producer side: staging, Snowpipe, external tables on Parquet
- Experience scaling data infrastructure through rapid engineering headcount growth
- SOC 2 Type II, GDPR, or similar compliance work