Fleet Observability: You monitor system health across our entire Hardware / IoT fleet. You build and maintain the scripts and data pipelines that allow us to identify anomalies and ensure revenue-critical performance at scale.
Durable Problem Solving: You design automated processes to fix the root causes of system failures, reducing manual work and improving reliability across our global fleet.
Observability & Tooling: You are passionate about automating manual intervention and eliminating operational toil. To do this, you build and refine observability tools and dashboards that turn raw fleet telemetry into actionable, real-time insights.
Defining Platform Capabilities: You identify necessary platform features based on recurring operational signals. You act as the technical translator, structuring field-level pain into clear technical requirements for our engineering teams.
Cross-Functional: You spend approximately 40% of your time on deep-dive operational triage, root cause analysis, and incident response. The remaining 60% is dedicated to coding internal tools, writing automation scripts, and driving reliability projects.

Your Background: You have a degree in Computer Science, Mechatronics, Industrial Engineering, Business Informatics, Electrical Engineering or a comparable field.
Professional Experience: You have at least 2–4 years of experience in Technical Operations, System Performance Engineering, or Automation Engineering, preferably within complex IoT or distributed systems environments.
Strengths and Interests: You possess a data-driven mindset and the spatial awareness needed to translate diverse physical edge environments (cameras, scanners, payment terminals) into digital performance insights. You are passionate about root cause analysis and have a relentless focus on maintaining high reliability and accuracy standards across a growing fleet.
Technical Skills: Advanced knowledge of Python and SQL to automate workflows and extract deep-system insights. Ideally, you also bring familiarity with observability and monitoring frameworks (e.g., ELK, Prometheus, or Grafana) and data analysis tools like Metabase.
Your Working Style: You have a high degree of ownership and excel at thinking in a solution-oriented way, for example when designing automated recovery paths or performing deep-dive root-cause analysis on fleet-wide anomalies.
Language Skills: You are a strong communicator in English (at least C1) and ideally also speak German.

Flexibility: With our hybrid work model, you can tailor your work schedule individually and spend time with your team in the office on our Anchor Days (Tuesday through Thursday).
Workation: Work from inspiring locations during your workation for fresh ideas.
Mobility subsidy: You have the choice between bike leasing or a travel allowance.
Measurable goals: Our OKRs allow you to directly measure your impact on our product and company success
Events: Celebrate our successes at our legendary team events and OKR parties
Catering: Fresh coffee from our portafilter machine around the clock for your energy and productivity. Discover the variety of Bella&Bona, our online cafeteria, or help yourself to our fruit basket or enjoy breakfast at the cereal bar
Health: Stay fit and work out with EGYM Wellpass or Urban Sports Club in over a thousand sports and health facilities throughout Germany.
Equipment: Decide on your own equipment to work efficiently and comfortably
Dress code: Dress in a way that makes you feel most comfortable

Reliability & Process Improvement Manager

Key skills