About this role

Oscilar is building the most advanced AI Risk Decisioning™ Platform, aiming to make the digital world safer. They are seeking an experienced Site Reliability Engineer to take ownership of reliability across their multi-region, cloud-native platform.

Responsibilities:

Architect and operate resilient cloud infrastructure (AWS, Pulumi, Kubernetes)
Lead initiatives to improve availability, latency, and performance at scale
Design and evolve our CI/CD pipelines to optimize for speed, safety, and repeatability
Define the metrics, alerts, and runbooks that form our observability backbone
Run chaos experiments and failure simulations to harden the platform
Mentor engineers and set best practices for SRE across the company

Requirements:

Proven track record as a senior SRE or Infrastructure Engineer in high-scale environments
Expert-level skills in AWS and Infrastructure as Code (Pulumi, Terraform)
Strong programming ability in Go or Python. We use Go
Deep understanding of distributed systems (Kafka, ClickHouse) and microservices architecture
Mastery of container orchestration (Kubernetes) and production debugging
Strong sense of ownership, and the judgment to balance velocity with reliability

Sr./Staff - Infrastructure/Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: