Oscilar is building the most advanced AI Risk Decisioning™ Platform, aiming to make the digital world safer. They are seeking an experienced Site Reliability Engineer to take ownership of reliability across their multi-region, cloud-native platform.
Responsibilities:
- Architect and operate resilient cloud infrastructure (AWS, Pulumi, Kubernetes)
- Lead initiatives to improve availability, latency, and performance at scale
- Design and evolve our CI/CD pipelines to optimize for speed, safety, and repeatability
- Define the metrics, alerts, and runbooks that form our observability backbone
- Run chaos experiments and failure simulations to harden the platform
- Mentor engineers and set best practices for SRE across the company
Requirements:
- Proven track record as a senior SRE or Infrastructure Engineer in high-scale environments
- Expert-level skills in AWS and Infrastructure as Code (Pulumi, Terraform)
- Strong programming ability in Go or Python. We use Go
- Deep understanding of distributed systems (Kafka, ClickHouse) and microservices architecture
- Mastery of container orchestration (Kubernetes) and production debugging
- Strong sense of ownership, and the judgment to balance velocity with reliability