Pinterest is a platform that inspires creativity and planning for memorable experiences. They are seeking a Principal Engineer to lead the modernization of their compute infrastructure, focusing on large-scale shared compute platforms and AI workloads.

Responsibilities:

Solving the challenges of replacing isolated pools of dedicated compute resources with a very large scale shared compute platform, shifting from machine-based designs to container-based designs
Working with leads across various platforms, especially stateful and data platforms, to build the right features and migration paths that work for them
Owning and driving up utilization on the shared compute platform by designing and implementing workload stacking, optimizing and bin packing, safe oversubscription, etc
Work with multiple customers with unique requirements to make sure the platform will address their needs and is not only a viable but a desirable solution for running their workloads
Leading a group of engineers around design topics, execution, trade offs, migration paths, observability, performance, and operability for the platform
Evolving the platform towards a multi-cloud abstraction layer to enable running workloads across multiple cloud providers
Being a role model for setting a high bar for production quality and engineering excellence in delivering a foundational technology which empowers the entire company
Working closely with partners around capacity planning, cost visibility, fungibility of virtual machine instance types, and efficiency
Putting special focus on the delivery of GPU resources through the platform, to enable and expedite AI workloads
Leverage AI tools to increase the velocity and ease of migrations, and create self service solutions for the customers of the platform as needed
Help the team apply AI to the operational aspects of running the cluster, discovering issues, and investigating and root causing issues
Expedite feature development using AI coding tools and be a thought leader on creating the right balance between speed and safety by designing safeguards and layers of defense

Requirements:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
12+ years of relevant industry experience with large scale, production distributed systems
5+ years of experience with Kubernetes in production
Experience working across SWE and SRE or Production Engineering teams to deliver robust production systems
Ability to work with cross-functional partners across multiple organizations
Passion for automation, reducing toil, and building proper tooling for getting the job done
Experience with running distributed data systems and migrating them to Kubernetes is highly preferred

Principal Engineer, Compute Platform

Key skills

About this role

Responsibilities:

Requirements: