The Hartford is an insurance company committed to making a difference and helping others achieve their goals. They are seeking a Cloud Reliability Engineering Lead to drive the reliability, scalability, and performance of API Hosting Platforms across multiple cloud providers, while building a team to ensure secure and continuously available Cloud API Platforms.
Responsibilities:
- Lead the design and implementation of reliability strategies across the API hosting platform, including availability, performance, capacity planning, and operational readiness
- Define and enforce reliability standards, SLIs/SLOs, and error budgets for platform services and customer-facing APIs
- Oversee incident management, ensuring strong triage, root-cause analysis, and preventive action development for Platform issues
- Drive automation to reduce manual operations, improve deployment safety, and strengthen platform secure baselines
- Establish and maintain robust observability practices, including logging, metrics, tracing, and synthetic monitoring
- Build and Lead a team of reliability engineers, providing mentorship, coaching, and technical direction
- Work with application owners to prioritize reliability-focused backlog items and improve platform health over time
- Identify and implement cost savings opportunities
- Serve as a subject‑matter expert for reliability engineering best practices across the organization
- Collaborate with security teams to ensure platform compliance with enterprise security standards
- Integrate security practices into CI/CD workflows and platform architecture
- Participate in risk assessments, audits, and compliance reviews for API platform services
- Advocate for modern reliability practices (e.g., chaos engineering, resilience testing, auto‑remediation)
- Evaluate and introduce new technologies, tooling, and methodologies to keep platform operations modern and efficient
- Monitor industry trends and translate them into actionable platform improvements
Requirements:
- 8+ years of technical experience, engineering, platform management and operations roles with a demonstrated track record of technical innovation and experience leading technically diverse teams
- Strong cloud engineering mindset with cloud experience across public cloud providers and the technologies most frequently used in engineering and managing highly reliable and automated technology environments
- Strong experience with API management or hosting platforms (Apigee, AWS API Gateway)
- Expertise with cloud-native technologies (Kubernetes, containers, distributed systems)
- Deep knowledge of performance and observability tools such as Dynatrace, Splunk, CloudWatch, Cloud Trail, and related tools
- Proven track record leading engineering teams or technical initiatives
- Strong understanding of CI/CD, release automation, and DevOps tooling
- Excellent communication, stakeholder management, and problem‑solving skills
- Knowledge of networking fundamentals, API security, and Zero Trust principles
- Experience with incident command roles in major incident processes
- Strong knowledge and experience with cloud product management, cloud engineering, and Agile principles
- Strong Experience with automation tools such as Ansible and Terraform
- Exceptional critical thinking and problem-solving skills
- Able to influence diverse teams and build strong business relationships