Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. They are seeking a Site Reliability Engineer II to lead technical initiatives for automating network engineering efforts and to grow their global platform infrastructure.

Responsibilities:

Taking an engineering approach in leading technical initiatives for automating network engineering efforts to guarantee the reliability of the global Elastic infrastructure
Growing our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, tooling and automations
Collaborating in an environment with an inclusive approach, and focusing on operational excellence, and uplifting others
Responding to and preventing repeated customer impact in response to major incidents and prioritized problem management. Our on call rotation uses follow-the-sun model where everyone participates in it in (mostly) their working hours

Requirements:

Experience operating a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform
Operation of hardware of software routers with some experience in BGP configuration
Competency in system / network administration, with professional skills in Linux on distributed systems at scale and Kubernetes Cilium networking
Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability. We want to hear about your customer first approach in solving operational problems with a SRE perspective
A background in software engineering to collaborate with engineers to expertly identify, implement and deliver solutions. An experience in public cloud and managed Kubernetes services is advantageous
Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships. Examples of working in distributed teams or working remotely is desirable
A background in software engineering, having successfully collaborated with engineers to expertly identify, implement and deliver solutions
Building or operating a Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it
Familiarity with containerized services (such as Docker.)
Proven experience in leading and improving alerting and major incident management standard processes metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues and quantify impacts to present to others at varying level of the organization
Experience diagnosing or designing, implementing and creating solutions with the Elastic Stack
A history of thriving in a self-organizing and sharing in a globally distributed team environment
The ability to strengthen team members in bringing out the best of each other by uplifting others with coaching and mentoring

Site Reliability Engineer II (Networking) - Platform

Key skills

About this role

Responsibilities:

Requirements: