NeuBird AI is scaling rapidly and is seeking an SRE Manager to lead their site reliability efforts. The role involves maintaining production reliability, managing a distributed SRE team, and establishing SRE practices to ensure operational stability as the company scales.
Responsibilities:
- Lead our SRE team in maintaining production reliability for Hawkeye across AWS and Azure environments
- Establish SLOs and SLIs that matter, build monitoring and alerting that catches issues before customers do, and manage incident response processes that minimize downtime and learn from failures
- Design on-call rotations that don't burn people out, run postmortems that drive real improvements, and work with engineering teams to build reliability into the product from the start
- Manage a distributed SRE team, balancing hands-on technical work with coaching and process improvement
- Coordinate across time zones, prioritize competing reliability work, and make tradeoffs between new features and operational stability
- Own our observability infrastructure—metrics, logs, traces, and alerting systems—ensuring we can debug production issues quickly and understand system behavior under load
- Implement infrastructure as code, automate toil away, establish capacity planning processes, and build the self-service tools that let engineers deploy safely without SRE gatekeeping
- Partner with security on compliance requirements, work with customer success on escalations, and represent SRE perspective in architecture discussions
Requirements:
- 5-7 years in SRE or infrastructure engineering
- At least 2-3 years managing SRE teams
- Experience at a SaaS company operating production systems at scale
- Built and maintained production infrastructure on AWS and Azure
- Managed Kubernetes clusters in production
- Implemented observability systems (Prometheus, Grafana, CloudWatch, Azure Monitor, or similar)
- Understanding of distributed systems
- Ability to debug complex production issues
- Led incident response during major outages
- Established SLO frameworks
- Designed on-call rotations
- Built automation that reduces operational burden
- Comfortable with infrastructure as code (Terraform, CloudFormation)
- Experience with configuration management and CI/CD pipelines
- Understanding of capacity planning and cost optimization
- Ability to balance reliability investments with product velocity
- Ability to hire and develop SREs
- Established sustainable on-call practices
- Created a culture where learning from failures is valued over blame
- Effective communication with both technical and non-technical audiences
- Ability to know when to escalate and when to handle issues within the team