NeuBird AI is scaling rapidly and is seeking an SRE Manager to lead their site reliability efforts. The role involves maintaining production reliability, managing a distributed SRE team, and establishing SRE practices to ensure operational stability as the company scales.

Responsibilities:

Lead our SRE team in maintaining production reliability for Hawkeye across AWS and Azure environments
Establish SLOs and SLIs that matter, build monitoring and alerting that catches issues before customers do, and manage incident response processes that minimize downtime and learn from failures
Design on-call rotations that don't burn people out, run postmortems that drive real improvements, and work with engineering teams to build reliability into the product from the start
Manage a distributed SRE team, balancing hands-on technical work with coaching and process improvement
Coordinate across time zones, prioritize competing reliability work, and make tradeoffs between new features and operational stability
Own our observability infrastructure—metrics, logs, traces, and alerting systems—ensuring we can debug production issues quickly and understand system behavior under load
Implement infrastructure as code, automate toil away, establish capacity planning processes, and build the self-service tools that let engineers deploy safely without SRE gatekeeping
Partner with security on compliance requirements, work with customer success on escalations, and represent SRE perspective in architecture discussions

Requirements:

5-7 years in SRE or infrastructure engineering
At least 2-3 years managing SRE teams
Experience at a SaaS company operating production systems at scale
Built and maintained production infrastructure on AWS and Azure
Managed Kubernetes clusters in production
Implemented observability systems (Prometheus, Grafana, CloudWatch, Azure Monitor, or similar)
Understanding of distributed systems
Ability to debug complex production issues
Led incident response during major outages
Established SLO frameworks
Designed on-call rotations
Built automation that reduces operational burden
Comfortable with infrastructure as code (Terraform, CloudFormation)
Experience with configuration management and CI/CD pipelines
Understanding of capacity planning and cost optimization
Ability to balance reliability investments with product velocity
Ability to hire and develop SREs
Established sustainable on-call practices
Created a culture where learning from failures is valued over blame
Effective communication with both technical and non-technical audiences
Ability to know when to escalate and when to handle issues within the team

Site Reliability Engineering Manager

Key skills

About this role

Responsibilities:

Requirements: