Elastic, the Search AI Company, is seeking a Senior Site Reliability Engineer to join their Platform Engineering department. The role involves designing, building, and scaling a multi-cloud platform, ensuring the reliability of Elastic's global infrastructure while fostering an inclusive and collaborative work environment.

Responsibilities:

Taking an engineering approach in leading technical initiatives for automating system engineering efforts to guarantee the reliability of the global Elastic infrastructure
Growing our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, tooling and automations
Using an inclusive approach at championing an environment focused on collaboration, operational excellence, and uplifting others
Responding to and preventing repeated customer impact in response to major incidents and prioritised problem management

Requirements:

Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability
A customer-first approach in solving operational problems with an SRE perspective
A background in software engineering to collaborate with engineers to expertly identify, implement and deliver solutions
Experience in public cloud and managed Kubernetes services
Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships
Examples of working in distributed teams or working remotely
Take ownership of protecting the confidentiality, integrity, and availability of organizational data and systems by following applicable privacy and security policies, standards, and procedures
Ensure that all individual contributions follow Elastic's Secure Software Development Framework (SSDF)
Proactively participate in mandatory role-based training to ensure personal technical execution consistently aligns with the highest standards of data protection, data privacy, and system resilience
You have operated a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform
You have built or operated a Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it
You have written non-trivial programs in Golang or other programming languages
You have worked with containerized services (such as Docker)
You have proven experience in leading and improving alerting and major incident management standard processes metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues and quantify impacts to present to others at varying levels of the organization
You have experience in system administration with professional skills in Linux on distributed systems at scale
You have diagnosed or designed, implemented and created solutions with the Elastic Stack
You are experienced in thriving in a self-organizing and sharing in a globally distributed team environment
You strengthen team members in bringing out the best of each other by uplifting others with coaching and mentoring

Senior Site Reliability Engineer (Resilience) - Platform Resilience

Key skills

About this role

Responsibilities:

Requirements: