Tech Lead, Site Reliability Engineering – Global Traffic Platform
San Jose, California, United States of America
Full Time
6 hours ago
$208,800 - $438,000 USD
Key skills
Change ManagementCommunication
About this role
About the Team The Global Traffic Infrastructure (GTI) team leverages unified platform capabilities to manage edge infrastructure outside China (both self-built and third-party) providing standardized, compliant, scalable, and cost-effective traffic infrastructure capabilities for edge services. Our vision is to build a global edge traffic infrastructure platform and become the long-term cornerstone of ByteDance’s global edge business in terms of scale, performance, and cost.
Responsibilities - SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services. - Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance. - Incident Management & On-call: Build a global 24x7 follow-the-sun on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time. - Postmortems & Stability Programs: Lead major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices. - Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.
The base salary range for this position in the selected city is $208800 - $438000 annually.