OfficeSpace Software is an AI workplace management platform that helps teams optimize their performance in the modern workplace. They are seeking a Senior Site Reliability Engineer to own the performance, reliability, and cost efficiency of their production platform, driving improvements and leading the transition to AI-assisted reliability engineering.
Responsibilities:
- Drive measurable improvements in latency, throughput, and availability across a large-scale production environment
- Own system performance—from Linux internals to Kubernetes scheduling—and eliminate bottlenecks before customers feel them
- Define and enforce SLIs, SLOs, and error budgets that balance speed, reliability, and growth
- Partner with application engineers to profile code paths, improve execution efficiency, and harden services under real load
- Lead database performance optimization across queries, indexing, replication, and workload isolation
- Design and oversee AI-assisted load testing, stress testing, and capacity planning workflows
- Guide the migration from monolithic deployments to multi-tenant Kubernetes platforms
- Reduce infrastructure spend through architectural decisions, right-sizing, and intelligent scaling strategies
- Build and supervise automation for infrastructure provisioning, configuration management, and observability
- Set clear operational standards for reliability, performance, and incident response—and raise the bar for how we run production
Requirements:
- 7+ years operating and evolving large-scale production systems
- Deep Linux systems expertise with hands-on performance tuning across CPU, memory, disk, and networking
- Strong Python skills for automation, tooling, and AI-assisted systems workflows
- Production experience with Ruby/Rails ecosystems, including Puma and Sidekiq
- Proven ability to diagnose and resolve complex database performance issues (MySQL/MariaDB or PostgreSQL)
- Advanced Kubernetes experience—workload sizing, scheduling, and multi-tenant operations
- Infrastructure-as-code mastery using Terraform and Terragrunt
- Experience with configuration management tools such as Puppet or Ansible
- Strong observability instincts across metrics, logs, and traces using tools like Prometheus, Grafana, Datadog, or ELK
- AI fluency—comfortable supervising AI agents for analysis, testing, and reporting, and validating their outputs
- A builder mindset. You move fast, take ownership, and raise standards
- Scaling and refactoring monolithic applications under real production load
- Extracting databases or stateful components from monoliths
- Apache and Nginx tuning at scale
- Redis performance optimization and operational management
- CI/CD systems and GitOps workflows, including ArgoCD
- Cloud cost optimization and FinOps-aligned operational practices