Zillow is reimagining how people move through the real estate market and is seeking a Senior Engineering Manager for Site Reliability Engineering to lead a multidisciplinary team responsible for the infrastructure and reliability of Follow Up Boss. The role involves driving cross-org alignment, modernizing infrastructure, and enhancing developer experience while ensuring high availability and performance.
Responsibilities:
- Own execution for the FUB infra & security roadmap, turning strategic goals (e.g., DB scalability, ZGCP adoption, infra cost and reliability targets) into a sequenced, realistic plan with clear milestones and measures of success
- Run an exemplary planning and delivery rhythm (quarterly), including estimation, risk management, dependency mapping, and stakeholder updates across FUB+ and central platform teams
- Ensure the team hits commitments with rare surprises, and when risk emerges, proactively engage partners to adjust scope, resources, or timeline with clear communication and tradeoffs
- Be accountable for reliability, performance, operability, and cost of core FUB services and infrastructure (EC2, RDS/Aurora, Redis/Valkey, networking, queues, SRE tooling)
- Lead the team to run a proud, low-toil on-call process: well-defined SLOs and error budgets, actionable alerting, fast incident detection/response, high-quality RCAs, and follow-through on remediation work
- Drive urgent, sustained progress on database scaling and performance, including capacity management, query and schema optimization, and modernization of data infrastructure
- Lead the FUB modernization strategy and execution for prioritized workloads (e.g., workers, supporting services), balancing devex wins, reliability, and risk while coordinating with central teams
- Partner with principal/staff engineers to refine FUB’s service scaling strategy, ensuring clear guidance on when teams build in the monolith vs. new services, and how infra supports these choices
- Raise the bar on developer environments and onboarding, reducing friction from dev boxes, tooling setup, and infra access; ensure new engineers can be productive quickly with reliable, self-service workflows
- Drive faster, safer deployments by improving CI/CD (GitLab, pipelines, AMI replacements, canary/progressive delivery) and aligning with ZG best practices for trunk-based development and feature flags
- Partner with product SDMs and tech leads to lower operational friction for dev teams (e.g., better runbooks, improved observability, easier infra integrations, automated guardrails and guardrails-powered AI tooling)
- Lead and grow a high-performing, inclusive SRE/infrastructure/security team, set clear expectations, provide candid feedback, and manage performance
- Develop technical leaders within and adjacent to the team (SREs, SDEs, security engineers, P5 ICs) through sponsorship, delegation, and stretch opportunities that expand impact beyond the immediate team
- Hire, retain, and onboard talent across SRE, infra SDE, ensuring skills match the breadth of FUB infra (AWS, Terraform/Ansible, Kubernetes/ZGCP, observability, security, databases)
- Be the primary technical and operational interface for FUB infra with FUB+ leadership and central Zillow platform orgs, driving alignment on priorities, tradeoffs, and architectural decisions
- Contribute materially to FUB+ tech vision and infra strategy, especially around service scaling, platform adoption, and our long-term operations model (e.g., SRE ownership boundaries, infra/security shared services, cost posture)
- Help identify and resolve cross-org misalignment (e.g., ownership boundaries, duplicated infra work, conflicting platform choices) and advocate for solutions that maximize Zillow-wide value, not just local optimization
- Champion innovation that improves reliability, scalability, cost, and devex for multiple teams, including adoption of ZG-standard tooling and patterns and infra-focused AI agents for automation, diagnostics, and operations
- Normalize AI usage within the infra team (e.g., code generation, runbook drafting, incident summarization, capacity modeling) and share successful patterns more broadly across FUB+ and platform partners
- Partner with security (ZG and FUB) to ensure infra and application environments meet audit, SOC2, SOX, privacy, and app-sec requirements, with clear ownership for remediation work and sustainable controls
- Forecast and manage runtime and infra costs (compute, storage, observability, networking), using tagging, dashboards, and guardrails to keep costs within budget while supporting growth
Requirements:
- Proven track record as an Senior Engineering Manager or equivalent leading SRE, platform, or infrastructure teams supporting high-availability SaaS products
- Experience scaling production systems and databases in a cloud environment (ideally AWS) and leading meaningful improvements to reliability, performance, and cost
- Demonstrated ability to shift a team from reactive to proactive roadmap-driven execution, including setting strategy, defining metrics, and driving sustained progress across multiple quarters
- Strong background in developer experience and CI/CD, with hands-on familiarity with tools such as Terraform/Ansible, GitLab, Kubernetes/ZGCP, and modern observability stacks
- Experience partnering with security, database, networking, and central platform teams in a multi-org environment; able to navigate ambiguity and complex stakeholder landscapes
- Demonstrated people leadership as a Senior Engineering Manager: managing senior engineers, handling performance issues with limited support, building inclusive culture, and developing leaders who can operate autonomously
- Comfortable experimenting with and operationalizing AI tools in engineering workflows; curiosity and learning mindset around emerging platform and infra capabilities
- Strong experience with scaling large LAMP / web applications
- SaaS / Sales CRM experience is a plus