Own the reliability, performance, and operability of complex, business‑critical production services and workflows.
Own complex and escalated production issues from support, and drive long‑term fixes in collaboration with engineering, including code, configuration, and architecture changes.
Proactively identify and address systemic risks that are identified during the problem‑solving process, and convert them into long‑term engineering improvements.
Lead production efficiency initiatives, and define, develop, and maintain processes, run‑books, and knowledge base integrity across multiple services or domains.
Define, build, and maintain production monitoring systems for critical services, ensuring deep visibility into system health and user experience.
Continuously improve alerting to minimize noise and ensure actionable, well‑documented runbooks with clearly owned responses.
Define and maintain SLIs/SLOs for key services, and use error budgets to guide operational and product decisions, influencing priorities where necessary.
Turn manual processes into robust automation, and champion automation patterns and tooling adoption across teams.
Own and drive the post‑mortem review process and actions arising from incident analysis, ensuring high‑quality follow‑up and measurable reliability improvements.
Collaborate with the support organization as a senior escalation point and systematically feed back knowledge, tooling enhancements, and improvement recommendations.
Collaborate with developers throughout the lifecycle of changes, from design through rollout and patch delivery, ensuring safe deployments and efficient incident mitigation.
Lead or significantly contribute to design reviews to ensure services are operable with minimal manual intervention in production (automation, safe deployments, clear run‑books, resilience patterns), and share learnings through documentation and feedback.
Mentor and coach other engineers in production engineering practices (observability, incident handling, automation, design for failure), helping to raise the operational bar across the organization.
Requirements
5–8+ years of experience in software engineering, site reliability, production engineering, or senior technical support roles operating distributed systems.
Experience with log analysis and advanced troubleshooting in complex production environments.
Strong programming experience (e.g., JS, Go, Typescript, Java, or C#).
Experience deploying and troubleshooting systems on public cloud platforms (Azure preferred).
Strong familiarity with observability tooling (e.g., Elastic, Prometheus, Grafana, OpenTelemetry).
Solid understanding of distributed systems, networking, automation, and CI/CD.
Tech Stack
Azure
Cloud
Distributed Systems
Grafana
Java
JavaScript
Prometheus
TypeScript
Go
Benefits
18 paid vacation days, plus 4 extra global VeeaMe Days for self-care and 24 paid volunteer hours annually through Veeam Cares
Private medical coverage for you and up to four dependents
Life, accident, and disability insurance with enhanced coverage
Annual flexible wellbeing allowance for physical and mental wellness
Free confidential counselling and coaching via Employee Assistance Program (EAP), including legal and financial advice
Meal, fuel, and transportation benefits based on work arrangement
Daycare reimbursement and safe cab facility for eligible employees
Opportunities to learn and grow through on-demand libraries (LinkedIn Learning, O’Reilly), mentoring, workshops, and learning events like our annual Global Day of Learning