GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems.
Responsibilities:
- Automate and maintain day-to-day operations of storage systems to support application demands
- Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency
- Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability
- Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment
- Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization
Requirements:
- 2+ years of professional experience with Ceph, working in a production environment
- 2+ years of experience in site reliability engineering or a similar role
- 2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems
- Experience working on Linux/Unix systems, with a focus on automation and operating at scale
- Proficiency in Python or Bash
- Experience with Ansible, Terraform, or SaltStack
- Experience with Nagios-based monitoring tools, such as Icinga2
- Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki
- Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems
- Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
- Exposure to and experience working with compute platforms (e.g., OpenStack, AWS)
- Familiarity with ability to contribute to CI/CD pipelines and automation workflows