Omilia is looking for a Senior Site Reliability Engineer who approaches operational problems as engineering challenges. The role involves defining service level objectives, identifying reliability risks, and working with engineering teams to enhance platform reliability and performance.

Responsibilities:

Act as a first responder during incidents; lead root cause analysis and blameless post-mortems
Turn incident learnings into systemic improvements — better tooling, better runbooks, better architecture
Provide input and guidance to squads on troubleshooting documentation and operational runbooks, ensuring they are practical and effective for production support
Define, implement, and iterate on SLIs, SLOs, and error budgets to drive data-informed reliability decisions
Identify and measure operational toil; build software and automation to systematically reduce it
Conduct capacity planning and performance analysis to stay ahead of scaling challenges
Design and evolve observability platforms (metrics, logs, traces, dashboards) that give engineering teams genuine insight into system behaviour — not just noise
Continuously improve alert quality: reduce false positives, increase signal, and ensure every alert is actionable
Partner with development teams to embed reliability thinking into the software delivery lifecycle — from design reviews to deployment strategies
Champion practices like chaos engineering, progressive rollouts, and failure injection testing
Mentor engineers across teams on reliability principles and operational best practices
Join on-call rotations and continuously improve the on-call experience for yourself and others

Requirements:

Fluent English - ideally on native level
Education: Bachelor's or Master's in Computer Science, Engineering, or equivalent practical experience
Demonstrated experience applying SRE principles: SLOs/SLIs, error budgets, toil reduction, and capacity planning
Experience building or significantly evolving observability and monitoring solutions (we use Prometheus, Grafana, and ELK, but we care more about your approach than your tool familiarity)
Experience with AWS
Linux systems administration background (RHEL/CentOS)
Hands-on experience operating services on container orchestration platforms (Kubernetes preferred)
A track record of improving the reliability of production systems at scale — through better automation, observability, and process, not just firefighting
Strong communication skills and the ability to influence engineering culture across teams
An analytical, systems-thinking mindset — you instinctively ask 'why did this fail?' and 'how do we make sure it can't?'
Infrastructure-as-code and configuration management experience (Terraform, Ansible)
Strong scripting and automation skills (Bash, Python, or Go) — you're comfortable writing the glue that keeps systems healthy and eliminates repetitive work
Networking fundamentals (TCP/IP, DNS, load balancing)
Database experience — relational (PostgreSQL, MySQL) or NoSQL (Redis)
Telephony domain knowledge (SIP, VoIP)
Familiarity with chaos engineering tools and practices

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: