Amazon is seeking a driven and talented Member of Technical Staff to join their AGI Autonomy organization, focusing on advancing the accuracy and efficiency of Artificial General Intelligence systems. The role involves designing and maintaining the compute platform for AI research, collaborating with research scientists, and optimizing infrastructure for performance and security.
Responsibilities:
- Design, build, and maintain the compute platform that powers all AI research at the SF AI Lab, managing large-scale GPU pools and ensuring optimal resource utilization
- Partner directly with research scientists to understand experimental requirements and develop infrastructure solutions that accelerate research velocity
- Implement and maintain robust security controls and hardening measures while enabling researcher productivity and flexibility
- Modernize and scale existing infrastructure by converting manual deployments into reproducible Infrastructure as Code using AWS CDK
- Optimize system performance across multiple GPU architectures, becoming an expert in extracting maximum computational efficiency
- Design and implement monitoring, orchestration, and automation solutions for GPU workloads at scale
- Ensure infrastructure is compliant with Amazon security standards while creatively solving for research-specific requirements
- Collaborate with AWS teams to leverage and influence cloud services that support AI workloads
- Build distributed systems infrastructure, including Kubernetes-based orchestration, to support multi-tenant research environments
- Serve as the bridge between traditional systems engineering and ML infrastructure, bringing enterprise-grade reliability to research computing