Cloudflare is on a mission to help build a better Internet, running one of the world’s largest networks. They are seeking talented Systems Reliability Engineers to build and operate their Edge platform, focusing on automation, scalability, and operational excellence.
Responsibilities:
- Build and operate our Edge platform running in more than 320 cities in over 120 countries
- Support our services in a 'follow the sun' model with offices in East Asia, Europe and North America
- Build tools to constantly improve service availability, performance, and operational velocity
- Leverage an array of monitoring, alerting and diagnostics tools while developing and enhancing the Cloudflare platform and its capabilities
- Own a wide portfolio of applications and services, running a tight feedback loop of developer and operator patterns
Requirements:
- Aptitude for identifying problems, owning them and working with others to solve them
- Linux systems experience
- 3 years experience in an SRE role or a role with similar functions
- Software development skills in some programming language such as Go, Rust, or Python
- Understanding of distributed software systems and large scale system design tradeoffs
- Intermediate experience of common network protocols like DNS and HTTP
- Experience with the Linux kernel and Linux software packaging
- Performance analysis and debugging
- Configuration management systems such as Saltstack, Chef, Puppet or Ansible
- Workflow automation systems such as Temporal or Apache Airflow
- Load balancing and reverse proxies such as Nginx, Varnish, HAProxy, Squid or Apache
- SQL databases
- Time series databases such as OpenTSDB, Graphite, Prometheus or Grafana
- Key/Value stores
- Internetworking and BGP
- Experience with continuous / rapid release engineering
- Strong tooling and automation development experience
- Experience working in a 24/7/365 service environment
- Experience working with large scale production distributed systems
- A history of contributing to Open Source Software