HubSpot is an AI-powered customer platform that helps businesses grow by connecting marketing, sales, and service. They are seeking a Principal Software Engineer to lead the evolution of their Data Hub, focusing on building large-scale data systems and ensuring reliability and usability for data-driven demand generation.
Responsibilities:
- Own core pieces of our data lake and analytics stack (e.g., Iceberg, Spark, batch and streaming pipelines) that power demand gen, segmentation, and scoring at scale
- Design and evolve data systems that balance cost, latency, data freshness, and reliability, making explicit tradeoffs using concepts like CAP theorem, efficient partitioning, and storage layout
- Partner closely with PM, product analytics, and GTM leaders to shape commercially meaningful solutions: better lead scoring, funnel visibility, audience building, and campaign attribution for marketers and sales
- Help make Data Hub an AI‑agent‑forward platform, where curated, evergreen datasets automatically feed AI agents and reporting surfaces rather than requiring manual stitching or ad-hoc pipelines
- Own platform-scale outcomes: Influence technical direction across the Data Hub product line and shape the architecture for unified profiles, segmentation, and datasets that other teams can build on
- Be a high-leverage, hands-on builder: Write code and build systems while leading end-to-end delivery of high-impact, multi-quarter initiatives, setting standards for reliability, observability, testing, and incident response
- Lead through architecture and influence: Define reusable patterns for ingestion, transformation, quality, sync, and observability, mentor senior engineers and tech leads
- Use AI code agents: Actively use AI-assisted development tools to speed iteration, reduce toil (e.g., scaffolding, tests, refactors), and improve code quality, while defining best practices with the human‑in‑the‑loop approach
- Champion incremental, outcome-focused delivery: Break down big, ambiguous problems into incremental milestones that deliver value early and often, balancing long-term platform bets with clear business impact (ARR, adoption, usage, efficiency)
- Raise the bar on engineering practices: Model strong habits around documentation, design reviews, testing, and observability, and help establish reliability and data quality standards so downstream AI agents and data activation use cases can trust the data they receive
Requirements:
- Deep experience building large-scale data systems with Apache Spark and modern table formats like Apache Iceberg, including efficient partitioning, clustering, and file layout for both heavy ingestion and low-latency reads
- Applies distributed systems principles and CAP theorem pragmatically to design fault-tolerant, horizontally scalable services that balance availability, consistency, latency, and cost, where it matters
- Can turn ambiguous business goals into clear data models, contracts, and SLAs across multiple storage and compute layers (e.g., Iceberg, warehouses, logs, CRM stores)
- Influence technical direction across the Data Hub product line and shape the architecture for unified profiles, segmentation, and datasets that other teams can build on
- Write code and build systems while leading end-to-end delivery of high-impact, multi-quarter initiatives, setting standards for reliability, observability, testing, and incident response
- Define reusable patterns for ingestion, transformation, quality, sync, and observability, mentor senior engineers and tech leads
- Actively use AI-assisted development tools to speed iteration, reduce toil (e.g., scaffolding, tests, refactors), and improve code quality, while defining best practices with the human-in-the-loop approach
- Break down big, ambiguous problems into incremental milestones that deliver value early and often, balancing long-term platform bets with clear business impact (ARR, adoption, usage, efficiency)
- Model strong habits around documentation, design reviews, testing, and observability, and help establish reliability and data quality standards so downstream AI agents and data activation use cases can trust the data they receive