Collaborate with Engineering, Platform, and Security teams to embed SRE best practices early in system design.
Lead advancements in observability, monitoring, alerting, and incident-response workflows.
Analyze platform performance to contribute to cost-optimization, performance tuning, and resilience planning.
Build infrastructure and automation tooling that improves platform reliability and enhances deployment safety.
Diagnose and resolve complex production issues across distributed systems, and drive open post-incident reviews so failures translate into durable improvements.
Strengthen system consistency and author clear, concise documentation for runbooks and operational processes.
Requirements
4+ years of experience in SRE, DevOps, platform engineering, or similar production-facing roles.
Strong problem-solving and debugging skills in distributed systems to maintain higher platform stability.
Eager to share operational guidelines, champion SRE practices across teams, and openly discuss what we can learn from system failures.
Excellent communication skills (English is our default language) with a genuine, collaborative approach to working across diverse engineering teams.
Strong hands-on experience with cloud environments (AWS, GCP, or similar) and proficiency with infrastructure-as-code and CI/CD pipelines.
Familiarity with Kubernetes (or container orchestration), event-driven architectures, or supporting ML/AI workloads and GPU infrastructure.
Tech Stack
AWS
Cloud
Distributed Systems
Google Cloud Platform
Kubernetes
Benefits
Flexible working models with a base in vibrant Prague and options for hybrid setup.
Competitive benefits designed to support your well-being, growth, and work-life harmony.
5 weeks of vacation, 5 sick/personal days, and extra 2 weeks of paternity leave.
Personal development, education, and language courses budget.
High-end tech (MacBook, external monitor, keyboard of your choice) and a MultiSport card.
Team offsites, regular meetups, and a friendly, ambitious team.