GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems.

Responsibilities:

Automate and maintain day-to-day operations of storage systems to support application demands
Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency
Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability
Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment
Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization

Requirements:

2+ years of professional experience with Ceph, working in a production environment
2+ years of experience in site reliability engineering or a similar role
2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems
Experience working on Linux/Unix systems, with a focus on automation and operating at scale
Proficiency in Python or Bash
Experience with Ansible, Terraform, or SaltStack
Experience with Nagios-based monitoring tools, such as Icinga2
Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki
Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
Exposure to and experience working with compute platforms (e.g., OpenStack, AWS)
Familiarity with ability to contribute to CI/CD pipelines and automation workflows

Site Reliability Engineer - Storage Engineer

Key skills

About this role

Responsibilities:

Requirements: