Pantheon is a company that powers over 300,000 websites globally for various organizations. They are seeking a Principal Engineer to lead the Hosting Platform team, responsible for the infrastructure supporting a vast number of websites and page views, focusing on reliability, scalability, and performance.
Responsibilities:
- Define and drive the technical roadmap for platform reliability, scalability, and performance across hundreds of thousands of sites and billions of monthly requests
- Architect, build, and evolve the core services underpinning Pantheon’s hosting platform — including edge delivery, container orchestration, database services, and site lifecycle management
- Partner with product, security, and infrastructure teams to identify root causes and design iterative, high-impact solutions
- Balance immediate business needs against long-term architectural health, making principled trade-offs that keep the platform sustainable as it scales
- Provide technical leadership across the engineering organization — setting direction, reviewing designs, and raising the bar for quality, reliability, and operability
- Mentor and coach engineers at all levels, providing technical guidance and growing engineering talent across the broader organization
- Contribute to an engineering-wide culture of collaboration, blameless postmortems, and continuous improvement
Requirements:
- 10+ years of software development experience, with significant tenure building platform or infrastructure products
- 5+ years of experience designing and architecting large-scale distributed systems
- Proficiency with container orchestration (Kubernetes), web server technologies (NGINX, PHP, Node.js, or similar), and infrastructure-as-code practices
- Demonstrated ability to own and improve a production platform serving millions of concurrent users or requests
- Proficiency in Go, Python, or equivalent systems-oriented languages
- Deep expertise in distributed systems design, large-scale service architecture, and cloud-native infrastructure (Pantheon runs on Google Cloud)
- Experience building or operating CDN, edge delivery, or networking-layer systems at scale — including caching strategies, cache invalidation, and edge performance optimization
- Strong understanding of multi-tenant hosting platforms — including the SLO definition, observability, and incident response required to operate them at scale for hundreds of thousands of customer sites
- Experience operating database systems as managed services — relational and non-relational — with appreciation for the operational complexity involved
- AI-native engineering practices — including fluency with AI coding assistants (GitHub Copilot, Cursor, Claude Code) and a track record of integrating AI tools into engineering workflows, automation, and architectural decision-making
- Experience building agentic and LLM-powered systems — including task orchestration, prompt engineering, and RAG — with the ability to prototype and ship AI features to production
- Awareness of the infrastructure requirements of AI workloads — including model serving, GPU/accelerator provisioning, inference latency optimization, and cost trade-offs at scale