Amazon is seeking a driven and talented Member of Technical Staff to join their AGI Autonomy organization, focusing on advancing the accuracy and efficiency of Artificial General Intelligence systems. The role involves designing and maintaining the compute platform for AI research, collaborating with research scientists, and optimizing infrastructure for performance and security.

Responsibilities:

Design, build, and maintain the compute platform that powers all AI research at the SF AI Lab, managing large-scale GPU pools and ensuring optimal resource utilization
Partner directly with research scientists to understand experimental requirements and develop infrastructure solutions that accelerate research velocity
Implement and maintain robust security controls and hardening measures while enabling researcher productivity and flexibility
Modernize and scale existing infrastructure by converting manual deployments into reproducible Infrastructure as Code using AWS CDK
Optimize system performance across multiple GPU architectures, becoming an expert in extracting maximum computational efficiency
Design and implement monitoring, orchestration, and automation solutions for GPU workloads at scale
Ensure infrastructure is compliant with Amazon security standards while creatively solving for research-specific requirements
Collaborate with AWS teams to leverage and influence cloud services that support AI workloads
Build distributed systems infrastructure, including Kubernetes-based orchestration, to support multi-tenant research environments
Serve as the bridge between traditional systems engineering and ML infrastructure, bringing enterprise-grade reliability to research computing

Member of Technical Staff, ML Infra, AGI

Key skills

About this role

Responsibilities: