Staff Database Reliability Engineer

United States of America

Full Time

2 weeks ago

H1B Sponsor Likely

Key skills

Database AdministrationPostgreSQLMySQLOracleMS SQL ServerDatabase Backup & RecoveryPerformance TuningDatabase SecurityCloud Platforms - AWSCloud Platforms - AzureCloud Platforms - GCPInfrastructure as CodeTerraformCloudFormationKubernetesObservability Tools - ELK stackDatabase MigrationDatabase ScalingIncident ManagementMonitoringAlertingPrometheusGrafanaDatadogCloudWatchService Level ObjectivesService Level AgreementsDisaster Recovery PlanningScripting Languages - PythonShell scriptingBashPowerShellAutomation Tools - AnsiblePuppetChefNetworking BasicsTCP/IPDNSFirewallLoad BalancersDatabase ConnectivityStorageDisk ManagementLinux OSRHELUbuntuCentOSFile SystemsProcess MonitoringSystem Logs AnalysisResource Limits ManagementPartitioning ToolsRAID ConfigurationsLogical Volume ManagementLog AnalysisRoot Cause AnalysisQuery Performance AnalysisPythonSQLShellAWSAzureGCPAnsibleRDSCosmos DBSQL ServerElasticsearchELK StackLogstashKibanaSaaSCI/CDCommunicationCollaboration

About this role

Rackspace Technology is a multicloud solutions expert that provides end-to-end solutions across applications, data, and security. They are seeking a Staff Database Reliability Engineer who will focus on maintaining and optimizing database systems, ensuring high availability and performance in cloud environments.

Responsibilities:

Someone who can work from office (Hyderabad location)
8-10+ years in DBA / Platform Engineering
Strong multi-cloud experience (Azure / AWS / GCP – at least two)
Deep HA/DR & performance tuning expertise
Automation-first mindset (Terraform, scripting, CI/CD)
Experience in SaaS/DBaaS environments preferred
Database Administration (DBA) Skills
Primary Database: PostgreSQL
Secondary Database: MySQL, Oracle, MS SQL Server
Database Backup & Recovery: Tools and strategies for database backups and disaster recovery
Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting
Database Security: User management, roles, access control, and auditing
Cloud Infrastructure Knowledge (DBaaS)
Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore)
Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes
Kubernetes & Containers: Running databases in containers (like Kubernetes)
Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
Database Migration: Migrating databases across different platforms or cloud environments
Database Scaling: Vertical and horizontal scaling techniques in cloud environments
SRE Principles (Site Reliability Engineering)
Incident Management: Handling database outages, incident response, and on-call rotations
Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch
Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets
Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions
Scripting and Automation
Scripting Languages: Python, Shell scripting, Bash, PowerShell
Automation Tools: Ansible, Puppet, Chef
Infrastructure Automation: Automating database deployment, patching, and scaling
Networking and Infrastructure
Networking Basics: TCP/IP, DNS, Firewall, Load Balancers
Database Connectivity: Connection pooling, failover strategies, and multi-region deployment
Storage and Disk Management: Understanding IOPS, latency, and throughput
OS Skills
Expertise in Linux OS (RHEL, Ubuntu, Centos)
Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs)
Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.)
Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage
Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting
Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance
Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount)
Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability
Troubleshooting and Debugging
Log Analysis: Reading and analysing database and system logs
Root Cause Analysis (RCA): Performing in-depth analysis after major incidents
Query Performance: Analysing slow queries, deadlocks, and resource contention
Soft Skills
Communication Skills: Clear communication with stakeholders and engineering teams
Problem-Solving: Ability to troubleshoot complex database issues under pressure
Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams

Requirements:

Someone who can work from office (Hyderabad location)
8-10+ years in DBA / Platform Engineering
Strong multi-cloud experience (Azure / AWS / GCP – at least two)
Deep HA/DR & performance tuning expertise
Automation-first mindset (Terraform, scripting, CI/CD)
Primary Database: PostgreSQL
Secondary Database: MySQL, Oracle, MS SQL Server
Database Backup & Recovery: Tools and strategies for database backups and disaster recovery
Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting
Database Security: User management, roles, access control, and auditing
Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore)
Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes
Kubernetes & Containers: Running databases in containers (like Kubernetes)
Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
Database Migration: Migrating databases across different platforms or cloud environments
Database Scaling: Vertical and horizontal scaling techniques in cloud environments
Incident Management: Handling database outages, incident response, and on-call rotations
Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch
Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets
Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions
Scripting Languages: Python, Shell scripting, Bash, PowerShell
Automation Tools: Ansible, Puppet, Chef
Infrastructure Automation: Automating database deployment, patching, and scaling
Networking Basics: TCP/IP, DNS, Firewall, Load Balancers
Database Connectivity: Connection pooling, failover strategies, and multi-region deployment
Storage and Disk Management: Understanding IOPS, latency, and throughput
Expertise in Linux OS (RHEL, Ubuntu, CentOS)
Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs)
Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.)
Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage
Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting
Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance
Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount)
Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability
Log Analysis: Reading and analysing database and system logs
Root Cause Analysis (RCA): Performing in-depth analysis after major incidents
Query Performance: Analysing slow queries, deadlocks, and resource contention
Communication Skills: Clear communication with stakeholders and engineering teams
Problem-Solving: Ability to troubleshoot complex database issues under pressure
Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams
Experience in SaaS/DBaaS environments preferred