The Hartford is an insurance company committed to making a difference and providing opportunities for growth. They are seeking a Principal Reliability Engineer to lead the strategic vision for Reliability Engineering within the Enterprise Data Services organization, ensuring the reliability and performance of data platforms and cloud infrastructure. This role involves influencing architectural direction, driving automation-first operations, and embedding reliability principles across the data product lifecycle.
Responsibilities:
- Work closely with the AVP, RE & Production Support, EDS defining the Reliability Engineering strategy for data platforms, data cloud environments, and data products
- Establish long‑term RE roadmaps, target operating models, and architectural patterns that scale with organizational growth
- Serve as the highest‑level technical escalation point for systemic reliability issues, influencing executive stakeholders and engineering leaders
- Leverage Enterprise provided standards and building blocks to Architect and evolve highly reliable, performant, and cost‑efficient cloud‑based platforms across AWS and GCP for all EDS services
- Influence and work directly with Platform Solution Architecture on new product enablement, hyper automation (end to end blueprint automation)
- Oversee reliability controls and fail‑safe patterns for Snowflake, EMR, Hadoop/Spark clusters, container platforms (e.g., Kubernetes), and mission‑critical data systems
- Lead the creation and enforcement of SLO/SLI frameworks that span the entire data lifecycle
- Develop and implement AI‑driven automation for anomaly detection, alert correlation, autonomous remediation, and predictive capacity management
- Leverage LLMs, prompt engineering, and cloud‑native AI services (AWS Bedrock, SageMaker, Vertex AI) to build intelligent runbooks, advanced troubleshooting agents, and generative‑AI‑enabled operational tooling
- Champion the adoption of machine learning–based observability and reliability analytics
- Adopt and architect enterprise‑wide data observability frameworks—including logging, metrics, tracing, distributed profiling, and event pipelines—for all data platforms and pipelines
- Establish gold‑standard incident response patterns, post‑incident reviews, and continuous improvement processes
- Drive elimination of toil across EDS, focusing on self‑healing systems, proactive detection, and autonomous operations
- Define RE best practices for modern data products, governed data pipelines, real‑time/streaming systems, and operational analytics platforms
- Ensure data quality, data timeliness, and SLAs for data products through automated checks, lineage-informed alerting, and pipeline reliability tooling
- Partner with Data Engineering to embed resilience patterns (idempotency, checkpointing, replayability, disaster recovery) into pipeline architectures
- Set and enforce standards for IaC, CI/CD, platform automation, reliability frameworks, operational readiness, and runbook quality across EDS
- Provide technical leadership and mentorship to Staff/Senior Engineers in the RE team and Production Support teams, influencing engineering culture and helping grow RE capabilities across the organization
- Represent Reliability Engineering in architectural reviews, enterprise governance forums, and executive‑level discussions
Requirements:
- 10+ years in one or more of the following areas: data, cloud, platform engineering, site/reliability engineering, or large-scale distributed systems, with experience in leadership or technology leader roles
- Proficiency with data or cloud platforms, including architectural patterns for resilience, networking, security, and distributed data infrastructure
- Deep experience supporting or engineering platforms such as Snowflake, EMR, Hadoop/Spark, Data Integration, and cloud-native data ecosystems
- Scripting and programming (preferably Python) for large-scale automation, platform tooling, and reliability frameworks
- Experience with Infrastructure-as-Code (Terraform, CloudFormation) and enterprise CI/CD
- Work closely with the AVP, RE & Production Support, EDS defining the Reliability Engineering strategy for data platforms, data cloud environments, and data products
- Establish long-term RE roadmaps, target operating models, and architectural patterns that scale with organizational growth
- Serve as the highest-level technical escalation point for systemic reliability issues, influencing executive stakeholders and engineering leaders
- Leverage Enterprise provided standards and building blocks to Architect and evolve highly reliable, performant, and cost-efficient cloud-based platforms across AWS and GCP for all EDS services
- Influence and work directly with Platform Solution Architecture on new product enablement, hyper automation (end to end blueprint automation)
- Oversee reliability controls and fail-safe patterns for Snowflake, EMR, Hadoop/Spark clusters, container platforms (e.g., Kubernetes), and mission-critical data systems
- Lead the creation and enforcement of SLO/SLI frameworks that span the entire data lifecycle
- Develop and implement AI-driven automation for anomaly detection, alert correlation, autonomous remediation, and predictive capacity management
- Leverage LLMs, prompt engineering, and cloud-native AI services (AWS Bedrock, SageMaker, Vertex AI) to build intelligent runbooks, advanced troubleshooting agents, and generative-AI-enabled operational tooling
- Champion the adoption of machine learning–based observability and reliability analytics
- Adopt and architect enterprise-wide data observability frameworks—including logging, metrics, tracing, distributed profiling, and event pipelines—for all data platforms and pipelines
- Establish gold-standard incident response patterns, post-incident reviews, and continuous improvement processes
- Drive elimination of toil across EDS, focusing on self-healing systems, proactive detection, and autonomous operations
- Define RE best practices for modern data products, governed data pipelines, real-time/streaming systems, and operational analytics platforms
- Ensure data quality, data timeliness, and SLAs for data products through automated checks, lineage-informed alerting, and pipeline reliability tooling
- Partner with Data Engineering to embed resilience patterns (idempotency, checkpointing, replayability, disaster recovery) into pipeline architectures
- Set and enforce standards for IaC, CI/CD, platform automation, reliability frameworks, operational readiness, and runbook quality across EDS
- Provide technical leadership and mentorship to Staff/Senior Engineers in the RE team and Production Support teams, influencing engineering culture and helping grow RE capabilities across the organization
- Represent Reliability Engineering in architectural reviews, enterprise governance forums, and executive-level discussions
- Experience in regulated or highly complex enterprise environments (financial services, insurance, healthcare)
- Prior experience as a Senior Staff Engineer, Engineering or Architecture leader with hands on experience, or similar senior technical role
- Knowledge of data governance, metadata, lineage systems, and data quality engineering practices
- Certifications in AWS, GCP, Kubernetes, or SRE/DevOps frameworks
- Background applying machine learning to operations—anomaly detection, event correlation, predictive modeling, and automated remediation
- Understand of AI-enabled developer/operations tools using LLMs, prompt engineering, or cloud AI services for reliability improvements
- Expertise with enterprise observability stacks (Prometheus, Grafana, Datadog, Splunk, Dynatrace, OpenTelemetry)
- Ability to design and enforce advanced SLI/SLO frameworks across complex data ecosystems
- Demonstrated ability to lead technical strategy at scale, influence senior engineering leaders, and set enterprise-wide standards
- Strong capability in mentoring engineers, providing architectural guidance, and fostering engineering excellence
- Exceptional communication skills for interacting with executives, senior architects, product leaders, and engineering teams