Role Overview

As a Site Reliability Engineer on our Platform Squad, you will play a key role in keeping Flip's infrastructure fast, resilient, and ready to scale.
You will shape the reliability culture, tools, and practices that enable our engineering teams to ship with confidence—at scale and without compromising availability.
This role is ideal for an engineer passionate about building high-throughput, highly available systems who wants to help define how a fast-growing SaaS platform operates in production.
Enable scaling: expand and optimize our cloud infrastructure on Azure and our Kubernetes clusters—designed for high throughput and maximum availability—to support Flip’s rapid global growth.
Ensure resilience & security: design and implement zero-downtime deployments, rollback mechanisms, and disaster-recovery strategies that keep our platform available around the clock.
Build observability: evolve our LGTM stack (Loki, Grafana, Tempo, Mimir) to provide every team with the visibility they need—and use it to define and optimize our SLOs.
Automate everything: design, develop, and optimize Infrastructure as Code with Pulumi in Go to eliminate manual toil and provide our platform to engineering teams as self-service.
Drive reliability practices: promote CI/CD best practices, incident management, post-mortems, and developer experience across the engineering organization.
Shape our roadmap: work with your squad and engineering leadership to define the platform direction—from scalable high-throughput systems and cost optimization to security posture and compliance.

Requirements

1–3 years of hands-on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
Experience operating and scaling cloud infrastructures (Azure, GCP, AWS).
Deep knowledge of Kubernetes and container orchestration in production environments.
Hands-on experience with modern observability stacks (e.g., Prometheus, Mimir, Loki, ELK) and familiarity with defining and operating SLOs and error budgets.
Solid software development skills in Go (preferred, since our IaC runs on Pulumi in Go), Python, or Kotlin.
Hands-on experience with Infrastructure as Code (e.g., Pulumi, OpenTofu, Terraform) and configuration tools (e.g., Ansible, Chef).
A collaborative mindset, strong communication skills, and business-fluent English.
Willingness to participate in on-call rotations to ensure the reliability of our platform.

Tech Stack

Ansible
AWS
Azure
Chef
Cloud
Google Cloud Platform
Grafana
Kotlin
Kubernetes
Prometheus
Python
Terraform
Go

Benefits

Work mode: We are remote-first, giving you the flexibility to work from home. At the same time, we value the benefits of in-person collaboration. Depending on the role, you will occasionally attend team events, workshops, or meetings at our offices in Berlin or Stuttgart—always with sufficient notice. The exact balance will be discussed transparently during your application process.
Work–life balance: We don’t want you to be glued to your desk, so we cover the cost of your E-Gym/Wellpass membership and offer company bike leasing (JobRad).
Celebrate successes: You’ll work with highly motivated, committed people in a relaxed work atmosphere.
Be in the action: You will actively shape Flip. Along the way, you’ll enable the rapid growth of a young tech company and grow with your goals. Positive atmosphere guaranteed.
Happy to be a Flipster: Look forward to regular team events and Culture Days that bring us together as Flipsters.
Work abroad: At Flip you can also work from other European countries—let’s talk about workation during the interview.

Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits