Operate and improve our monitoring stack (Prometheus, Grafana, OpenTelemetry)
instrument services to expose the right metrics, define what we watch in production, and shape alerting so teams get actionable signals without the noise.
Help define how we run incidents
clear communication, structured learning afterward, and supporting artifacts (status page, runbooks).
Work with platform and product engineers to make reliability standards practical
help teams adopt better tooling or practices when things change, and write documentation people actually use.

You have hands-on experience choosing what to measure in production
not just reading dashboards, but picking signals that reflect the customer experience.
You're comfortable with incidents and alerts, from early detection through resolution and follow-up so similar issues are less likely to recur.
You have hands-on experience with Prometheus, Grafana, OpenTelemetry, or similar, and with alert-routing tools such as PagerDuty.
You read and write code: you can follow services and pipelines across the stack and collaborate on technical details with the teams building them.
You know what good post-incident culture looks like in practice
blame-free, learning-focused, and actually used to make things better
even if your past title never mentioned reliability.
You can write clear, concise guidance that teams adopt, and you work constructively toward sound decisions.
You're driven to automate repetitive tasks and improve developer workflows.
Nice to have: Meaningful hands-on experience as an application or backend developer
you've built things that run in production and approach observability as someone who needs it as a "user," not just the person who sets it up.
Experience building and maintaining infrastructure on AWS (EC2, EKS, S3, CloudFormation, or similar), and hands-on experience with container technologies.
Some familiarity with CI/CD pipelines or release practices
enough to have an informed opinion on what makes deployments reliable and safe.

Space, support, and autonomy for personal growth, with a direct impact on Apify's success
Full-time position in Prague (Lucerna Palace) or Brno (Titanium)
Option to work remotely
Flexible working hours (perfect for both night owls and early birds)
Nobody counts holidays as long as the work gets done
Unlimited Claude for every Apifier. We don't count tokens. Just use them well
Stock options and profit sharing
We welcome pets, kids, and bikes at the office
Epic team buildings and offsites with biking, canoeing, and other adventures
Solid education and training budget, conference tickets, internal "Eat & Learn" sessions, and the possibility to work across teams
Generous hardware budget
Free lunches every day when you're in the office
Unlimited supply of coffee, beer, and snacks
Free entry to the wonderful Prague Zoo
Free Multisport card
Ping-pong, chess, PS5, lightsabers, foosball league after lunch.

Platform Reliability Engineer

Key skills