Berkeley Research Group's AI Department is seeking an AI Infrastructure Engineer to lead the development of their Virtual AI Lab initiative. The role focuses on creating a virtual access layer for a high-performance AI Lab, ensuring secure and efficient remote processing capabilities for large-scale document processing.
Responsibilities:
- Design and implement a virtual access layer for the physical Ai Lab infrastructure
- Build scalable remote processing capabilities supporting 100,000+ documents per day
- Create customizable, expandable interfaces for different BRG business units
- Optimize infrastructure for maximum LLM token throughput (OpenAI/Anthropic)
- Implement secure authentication and access management systems
- Ensure high availability and fault tolerance for mission-critical AI workloads
- Lead infrastructure projects from conception to production deployment
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field
- Minimum six to eight (6-8) years of hands-on experience designing, deploying, and managing scalable cloud infrastructure
- Strong experience with Infrastructure as Code (IaC) tools and methodologies
- Experience designing, implementing, and maintaining scalable, secure, and cost-efficient cloud/on-prem solutions
- Proven ability to manage and lead projects to deliver high-quality, replicable solutions
- Proficiency in VCS (Git/GitHub), modern coding languages (Python, .NET, Java, etc.), Software Development Life Cycle, and CI/CD practices
- Experience with API design and implementation for distributed systems
- Knowledge of GPU infrastructure and optimization for AI workloads
- Hands-on experience with AWS Services including: EC2/Lambda (apps/functions), SageMaker (ML), S3 (file management), Fargate/ECS/EKS (containerization), CDK/Terraform (IaC), Cost Explorer/Budgets
- Experience with LLM deployment and optimization (OpenAI, Anthropic, etc.)
- Background in building AI/ML infrastructure and platforms
- Experience with virtual desktop infrastructure (VDI) or remote access solutions
- Knowledge of distributed computing and job scheduling systems
- AWS certifications (Solutions Architect, Machine Learning, or similar)
- Experience with cost management and optimization strategies in the cloud
- Familiarity with security best practices for AI systems and data handling