Pinterest is a platform that inspires creativity and planning for memorable experiences. They are seeking a Principal Engineer to lead the modernization of their compute infrastructure, focusing on large-scale shared compute platforms and AI workloads.
Responsibilities:
- Solving the challenges of replacing isolated pools of dedicated compute resources with a very large scale shared compute platform, shifting from machine-based designs to container-based designs
- Working with leads across various platforms, especially stateful and data platforms, to build the right features and migration paths that work for them
- Owning and driving up utilization on the shared compute platform by designing and implementing workload stacking, optimizing and bin packing, safe oversubscription, etc
- Work with multiple customers with unique requirements to make sure the platform will address their needs and is not only a viable but a desirable solution for running their workloads
- Leading a group of engineers around design topics, execution, trade offs, migration paths, observability, performance, and operability for the platform
- Evolving the platform towards a multi-cloud abstraction layer to enable running workloads across multiple cloud providers
- Being a role model for setting a high bar for production quality and engineering excellence in delivering a foundational technology which empowers the entire company
- Working closely with partners around capacity planning, cost visibility, fungibility of virtual machine instance types, and efficiency
- Putting special focus on the delivery of GPU resources through the platform, to enable and expedite AI workloads
- Leverage AI tools to increase the velocity and ease of migrations, and create self service solutions for the customers of the platform as needed
- Help the team apply AI to the operational aspects of running the cluster, discovering issues, and investigating and root causing issues
- Expedite feature development using AI coding tools and be a thought leader on creating the right balance between speed and safety by designing safeguards and layers of defense
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
- 12+ years of relevant industry experience with large scale, production distributed systems
- 5+ years of experience with Kubernetes in production
- Experience working across SWE and SRE or Production Engineering teams to deliver robust production systems
- Ability to work with cross-functional partners across multiple organizations
- Passion for automation, reducing toil, and building proper tooling for getting the job done
- Experience with running distributed data systems and migrating them to Kubernetes is highly preferred