OfficeSpace Software is an AI workplace management platform that helps teams optimize their performance in the modern workplace. They are seeking a Senior Site Reliability Engineer to own the performance, reliability, and cost efficiency of their production platform, driving improvements and leading the transition to AI-assisted reliability engineering.

Responsibilities:

Drive measurable improvements in latency, throughput, and availability across a large-scale production environment
Own system performance—from Linux internals to Kubernetes scheduling—and eliminate bottlenecks before customers feel them
Define and enforce SLIs, SLOs, and error budgets that balance speed, reliability, and growth
Partner with application engineers to profile code paths, improve execution efficiency, and harden services under real load
Lead database performance optimization across queries, indexing, replication, and workload isolation
Design and oversee AI-assisted load testing, stress testing, and capacity planning workflows
Guide the migration from monolithic deployments to multi-tenant Kubernetes platforms
Reduce infrastructure spend through architectural decisions, right-sizing, and intelligent scaling strategies
Build and supervise automation for infrastructure provisioning, configuration management, and observability
Set clear operational standards for reliability, performance, and incident response—and raise the bar for how we run production

Requirements:

7+ years operating and evolving large-scale production systems
Deep Linux systems expertise with hands-on performance tuning across CPU, memory, disk, and networking
Strong Python skills for automation, tooling, and AI-assisted systems workflows
Production experience with Ruby/Rails ecosystems, including Puma and Sidekiq
Proven ability to diagnose and resolve complex database performance issues (MySQL/MariaDB or PostgreSQL)
Advanced Kubernetes experience—workload sizing, scheduling, and multi-tenant operations
Infrastructure-as-code mastery using Terraform and Terragrunt
Experience with configuration management tools such as Puppet or Ansible
Strong observability instincts across metrics, logs, and traces using tools like Prometheus, Grafana, Datadog, or ELK
AI fluency—comfortable supervising AI agents for analysis, testing, and reporting, and validating their outputs
A builder mindset. You move fast, take ownership, and raise standards
Scaling and refactoring monolithic applications under real production load
Extracting databases or stateful components from monoliths
Apache and Nginx tuning at scale
Redis performance optimization and operational management
CI/CD systems and GitOps workflows, including ArgoCD
Cloud cost optimization and FinOps-aligned operational practices

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: