Capital One is creating responsible and reliable AI systems, changing banking for good. The Senior Distinguished Engineer, AI Compute will engineer and scale foundational compute capabilities for the company's AI platform, leveraging expertise in distributed systems and machine learning.

Responsibilities:

Architect and build control and data plane implementations required to realize a highly available, multi-tenant, large scale and a secure machine learning platform
Develop Ray and Spark distributed compute engine solutions to accelerate diverse workloads from LLM pre-training and reinforcement learning to large-scale data processing, while maximizing compute unit economics
Engineer systemic improvements for operational excellence including automating KTLO (Keep The Lights On) workflows
Direct the technical execution of a diverse project portfolio, collaborating with developers specializing in everything ranging from distributed microservices to running large foundation models
Work cross-functionally with product and program management disciplines, and stakeholder and partners across Capital One to help optimize business outcomes while driving towards strong technology solutions
Share your passion for staying on top of tech trends, experimenting with and learning new technologies, participating in internal & external technology communities, and leading system design and code review sessions
Help elevate the Capital One Distinguished Engineering community and establish yourself as a go-to resource on given technologies and technology-enabled capabilities
Lead the way in creating next-generation talent, mentoring internal talent and actively recruiting external talent to bolster the Capital One tech talent pool

Requirements:

Bachelor's Degree
At least 7 years of experience with application architecture and design patterns
At least 5 years of experience with distributed databases, microservice architectures, and high availability systems
Degree in Computer Science or a Master's Degree in Software Engineering
Hands on experience in the internals of Ray (Actors/GCS/Scheduling) or Spark (Query Optimizer/Memory Management)
Experience building platforms that support LLM training, fine-tuning, or high-throughput inference
Hands-on experience with AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and cost-optimization strategies

Senior Distinguished Engineer, AI Compute (Remote Eligible)

Key skills

About this role

Responsibilities:

Requirements: