Rackspace Technology is a multicloud solutions expert that provides end-to-end solutions across applications, data, and security. They are seeking a Staff Database Reliability Engineer who will focus on maintaining and optimizing database systems, ensuring high availability and performance in cloud environments.
Responsibilities:
- Someone who can work from office (Hyderabad location)
- 8-10+ years in DBA / Platform Engineering
- Strong multi-cloud experience (Azure / AWS / GCP – at least two)
- Deep HA/DR & performance tuning expertise
- Automation-first mindset (Terraform, scripting, CI/CD)
- Experience in SaaS/DBaaS environments preferred
- Database Administration (DBA) Skills
- Primary Database: PostgreSQL
- Secondary Database: MySQL, Oracle, MS SQL Server
- Database Backup & Recovery: Tools and strategies for database backups and disaster recovery
- Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting
- Database Security: User management, roles, access control, and auditing
- Cloud Infrastructure Knowledge (DBaaS)
- Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore)
- Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes
- Kubernetes & Containers: Running databases in containers (like Kubernetes)
- Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
- Database Migration: Migrating databases across different platforms or cloud environments
- Database Scaling: Vertical and horizontal scaling techniques in cloud environments
- SRE Principles (Site Reliability Engineering)
- Incident Management: Handling database outages, incident response, and on-call rotations
- Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch
- Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets
- Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions
- Scripting and Automation
- Scripting Languages: Python, Shell scripting, Bash, PowerShell
- Automation Tools: Ansible, Puppet, Chef
- Infrastructure Automation: Automating database deployment, patching, and scaling
- Networking and Infrastructure
- Networking Basics: TCP/IP, DNS, Firewall, Load Balancers
- Database Connectivity: Connection pooling, failover strategies, and multi-region deployment
- Storage and Disk Management: Understanding IOPS, latency, and throughput
- OS Skills
- Expertise in Linux OS (RHEL, Ubuntu, Centos)
- Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs)
- Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.)
- Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage
- Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting
- Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance
- Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount)
- Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability
- Troubleshooting and Debugging
- Log Analysis: Reading and analysing database and system logs
- Root Cause Analysis (RCA): Performing in-depth analysis after major incidents
- Query Performance: Analysing slow queries, deadlocks, and resource contention
- Soft Skills
- Communication Skills: Clear communication with stakeholders and engineering teams
- Problem-Solving: Ability to troubleshoot complex database issues under pressure
- Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams
Requirements:
- Someone who can work from office (Hyderabad location)
- 8-10+ years in DBA / Platform Engineering
- Strong multi-cloud experience (Azure / AWS / GCP – at least two)
- Deep HA/DR & performance tuning expertise
- Automation-first mindset (Terraform, scripting, CI/CD)
- Primary Database: PostgreSQL
- Secondary Database: MySQL, Oracle, MS SQL Server
- Database Backup & Recovery: Tools and strategies for database backups and disaster recovery
- Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting
- Database Security: User management, roles, access control, and auditing
- Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore)
- Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes
- Kubernetes & Containers: Running databases in containers (like Kubernetes)
- Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
- Database Migration: Migrating databases across different platforms or cloud environments
- Database Scaling: Vertical and horizontal scaling techniques in cloud environments
- Incident Management: Handling database outages, incident response, and on-call rotations
- Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch
- Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets
- Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions
- Scripting Languages: Python, Shell scripting, Bash, PowerShell
- Automation Tools: Ansible, Puppet, Chef
- Infrastructure Automation: Automating database deployment, patching, and scaling
- Networking Basics: TCP/IP, DNS, Firewall, Load Balancers
- Database Connectivity: Connection pooling, failover strategies, and multi-region deployment
- Storage and Disk Management: Understanding IOPS, latency, and throughput
- Expertise in Linux OS (RHEL, Ubuntu, CentOS)
- Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs)
- Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.)
- Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage
- Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting
- Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance
- Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount)
- Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability
- Log Analysis: Reading and analysing database and system logs
- Root Cause Analysis (RCA): Performing in-depth analysis after major incidents
- Query Performance: Analysing slow queries, deadlocks, and resource contention
- Communication Skills: Clear communication with stakeholders and engineering teams
- Problem-Solving: Ability to troubleshoot complex database issues under pressure
- Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams
- Experience in SaaS/DBaaS environments preferred