Hyperbolic Labs is on a mission to democratize AI through their Open-Access AI Cloud, providing an innovative GPU marketplace and AI inference service. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and security of their infrastructure, focusing on defining service level objectives, incident response, and system resilience.

Responsibilities:

Define and maintain service level objectives for job success rates
Build robust incident response systems
Manage capacity across our distributed GPU network
Implement secure rollout and rollback mechanisms that keep the platform running smoothly 24/7
Establish reliability standards that define customer trust in the platform
Design monitoring and alerting systems that provide deep visibility into infrastructure
Build automation for capacity management and resource allocation
Lead incident response and post-mortem processes
Work closely with engineering teams to improve system resilience
Focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers
Implement key management systems and build compliance frameworks

Requirements:

Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure
Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines
Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
Background in distributed systems, peer-to-peer networks, or decentralized infrastructure
Knowledge of multi-tenancy security patterns, container security, and runtime security tools
Experience with chaos engineering, fault injection, and resilience testing
Familiarity with cost optimization strategies for cloud infrastructure and GPU resources
Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)
Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups
Contributions to open-source reliability, observability, or security tools

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: