Hyperbolic Labs is on a mission to democratize AI through their Open-Access AI Cloud, providing an innovative GPU marketplace and AI inference service. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and security of their infrastructure, focusing on defining service level objectives, incident response, and system resilience.
Responsibilities:
- Define and maintain service level objectives for job success rates
- Build robust incident response systems
- Manage capacity across our distributed GPU network
- Implement secure rollout and rollback mechanisms that keep the platform running smoothly 24/7
- Establish reliability standards that define customer trust in the platform
- Design monitoring and alerting systems that provide deep visibility into infrastructure
- Build automation for capacity management and resource allocation
- Lead incident response and post-mortem processes
- Work closely with engineering teams to improve system resilience
- Focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers
- Implement key management systems and build compliance frameworks
Requirements:
- Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
- Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
- Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
- Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
- Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
- Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
- Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
- Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
- Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure
- Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines
- Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
- Background in distributed systems, peer-to-peer networks, or decentralized infrastructure
- Knowledge of multi-tenancy security patterns, container security, and runtime security tools
- Experience with chaos engineering, fault injection, and resilience testing
- Familiarity with cost optimization strategies for cloud infrastructure and GPU resources
- Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)
- Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups
- Contributions to open-source reliability, observability, or security tools