Capital One is creating responsible and reliable AI systems, changing banking for good. The Senior Distinguished Engineer, AI Compute will engineer and scale foundational compute capabilities for the company's AI platform, leveraging expertise in distributed systems and machine learning.
Responsibilities:
- Architect and build control and data plane implementations required to realize a highly available, multi-tenant, large scale and a secure machine learning platform
- Develop Ray and Spark distributed compute engine solutions to accelerate diverse workloads from LLM pre-training and reinforcement learning to large-scale data processing, while maximizing compute unit economics
- Engineer systemic improvements for operational excellence including automating KTLO (Keep The Lights On) workflows
- Direct the technical execution of a diverse project portfolio, collaborating with developers specializing in everything ranging from distributed microservices to running large foundation models
- Work cross-functionally with product and program management disciplines, and stakeholder and partners across Capital One to help optimize business outcomes while driving towards strong technology solutions
- Share your passion for staying on top of tech trends, experimenting with and learning new technologies, participating in internal & external technology communities, and leading system design and code review sessions
- Help elevate the Capital One Distinguished Engineering community and establish yourself as a go-to resource on given technologies and technology-enabled capabilities
- Lead the way in creating next-generation talent, mentoring internal talent and actively recruiting external talent to bolster the Capital One tech talent pool
Requirements:
- Bachelor's Degree
- At least 7 years of experience with application architecture and design patterns
- At least 5 years of experience with distributed databases, microservice architectures, and high availability systems
- Degree in Computer Science or a Master's Degree in Software Engineering
- Hands on experience in the internals of Ray (Actors/GCS/Scheduling) or Spark (Query Optimizer/Memory Management)
- Experience building platforms that support LLM training, fine-tuning, or high-throughput inference
- Hands-on experience with AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and cost-optimization strategies