The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to join their Data Platform Engineering team. In this role, you will be responsible for operating systems supporting data-oriented teams, designing new systems, and ensuring scalability to meet demand.
Responsibilities:
- Simplifying our operations by standardizing how we deploy services and how we benefit from virtualizing and containerizing our applications
- Supporting our users, removing roadblocks, and making them more productive!
- Monitoring of systems and services, optimization of performance, and resource utilization
- Proactively identifying sources of instability in distributed systems and analyzing how complex systems fail from a reliability and resilience perspective
- Automation and streamlining of tasks, as well as identifying process gaps
- Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
- Mentoring peers in your areas of technical and operational strength
- Expected to travel domestically or potentially internationally 2-3 times a year for team gatherings and conferences
Requirements:
- 5+ years of experience in an SRE/Operations/DevOps or software engineering role
- Experience with running applications and services at scale
- Proficiency with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
- Comfort with Open Source configuration management and orchestration tools (Puppet, Ansible, Terraform etc.)
- Communicative technical English
- Virtualization of data and compute
- Share our values, appreciate our code of conduct, support our team norms, and work in accordance with all three
- Customer-oriented. We're here to help, not to block
- Strong English language skills and ability to work independently, as an effective part of a globally distributed team
- Comfortable working in the open
- Passionate about supporting our communities
- Experience with Kubernetes and Ceph
- Experience with operating a data platform