Airbnb is a global hospitality company that connects hosts and guests, and they are seeking a Staff Software Engineer specializing in GraphQL. This role involves driving platform reliability, designing deployment pipelines, and contributing to the Viaduct platform that serves as a critical data access layer for Airbnb's API traffic.
Responsibilities:
- Drive platform reliability and operational excellence by designing and implementing deployment pipelines, SLO frameworks, observability tooling, performance improvements, and AI-enabled incident response automation that help maintain Viaduct's 99.99% uptime target across Airbnb's critical API traffic
- Contribute to runtime resiliency initiatives including resource attribution, performance regression testing, and proactive monitoring to ensure the multi-tenant GraphQL platform scales efficiently and degrades gracefully under load
- Architect and deliver AI-powered operational tooling that accelerates incident triage, reduces mean-time-to-mitigation, and empowers both the Viaduct team and tenant engineers with self-service debugging capabilities
- Shape the future of Viaduct Modern by contributing to the next-generation architecture, improving developer experience for hundreds of engineers, and establishing patterns that will be shared with the open-source community
- Embrace an AI-first engineering approach, using LLM-powered agents to generate and iterate on code while you focus on problem-solving, system design, and quality oversight
- Investigate and resolve complex production issues by analyzing distributed traces, resource utilization patterns, and system metrics to identify root causes and implement durable fixes
- Design and implement observability features including span instrumentation, SLO dashboards, and fine-grained attribution for blocking time, memory, and CPU across tenant workloads
- Develop and iterate on tooling for deployment triage, service health monitoring, and incident response automation using LLM capabilities
- Lead technical design discussions and RFCs for initiatives like performance regression testing pipelines, emergency deployment workflows, and runtime resiliency improvements
- Partner with tenant teams to debug performance issues, provide guidance on GraphQL best practices, and enable self-service capabilities for common operational tasks
- Contribute to open-source Viaduct by ensuring platform improvements are generalizable and well-documented for the broader engineering community
Requirements:
- 9+ years of software engineering experience, with significant depth in backend systems, distributed architectures, and platform engineering
- Deep expertise in observability and monitoring, including experience designing SLO frameworks, distributed tracing systems, and metrics pipelines at scale
- Proven track record in reliability engineering, with hands-on experience in incident response, root cause analysis, and building systems that maintain high availability (99.99%+)
- Strong experience with performance tuning and resource management in JVM-based systems, including profiling, garbage collection optimization, and understanding of concurrency models (blocking I/O, thread pools, coroutines in Kotlin)
- Experience operating critical, high-traffic systems with a focus on deployment safety, automated rollbacks, and progressive delivery strategies
- Familiarity with GraphQL or similar API gateway/data access layer technologies
- Experience building developer tooling and platforms, with a product mindset focused on developer experience and self-service capabilities
- Strong leadership and communication skills with the ability to partner effectively across infrastructure and product engineering teams