Responsible to design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL which supports a growing 24x7 SaaS platform.
Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness.
Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events.
Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations.
Provides escalated technical guidance and support to other technology teams throughout the organization
Provides on-call coverage for production support and other duties as required.
Accountable for complying with HIPAA security policies within the database platform
Ensure all solutions and operational activities adhere to the security and operating policies established by the organization
Own and continuously improve our Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack (e.g., Prometheus, Grafana, New Relic, or equivalent). Familiarity with PGAnalyze or Percona a plus.
Automate system maintenance tasks using Bash, Powershell, Python, or Ansible. Manage infrastructure as code (IaC) writing Ansible playbooks. Some exposure to Terraform a plus.
Experience with writing & designing ETL pipelines using Python a plus
Understand and maintain various PostgreSQL ecosystem components like: PgBouncer, PgBackrest, HaProxy, RepMgr a plus
Excellent communication and interpersonal skills.
Requirements
BS degree in Information Systems, Engineering, or equivalent experience
7-10+ years of Engineering experience with Database Engineering, Systems Engineering, DevOps and/or SRE
Experience in cloud-based compute, storage, and containerization solutions (Azure & Kubernetes preferred)
Proficiency with operating PostgreSQL in a Linux environment is a plus
Expertise with an observability/monitoring platform (e.g., Prometheus/Grafana, New Relic, Datadog, or equivalent); Datadog experience is a plus.
Experience working in Agile/DevOps environments and operating production services with ITSM practices where applicable
Tech Stack
Ansible
Azure
Cloud
ETL
Grafana
HAProxy
ITSM
Kubernetes
Linux
Postgres
Prometheus
Python
Terraform
Benefits
Employer sponsored health, dental, vision, life, and disability insurance