Amazon Web Services (AWS) is seeking a Software Development Engineer II for the ML Infrastructure team to enhance Machine Learning technologies. The role involves building and maintaining infrastructure for monitoring performance, automating testing, and ensuring the successful delivery of ML networking software.
Responsibilities:
- Build and maintain infrastructure that monitors and reports on functionality and performance of massive testing workloads run at scale across multiple GPU instance types
- Use Jenkins, internal Amazon CI/CD tools, Linux, and public AWS products to automate testing and delivery of ML networking libraries - including collective communication frameworks, network transport layers, and GPU communication libraries
- Write Python code that orchestrates large clusters, runs benchmarks and ML applications across a matrix of instance types, operating systems, and software stack versions
- Use AWS Managed Grafana and Athena to digest performance data and build dashboards that catch functional and performance regressions before they reach customers
- Build automation using LLMs to analyze test failures and surface actionable insights to developers
- Contribute to cross-team readiness for new instance type launches by delivering performance data that shapes go/no-go decisions
- Manage the complexity of infrastructure covering many instance types, software stacks, Linux operating systems, and latest releases and make it easy to evolve
- You write Python to orchestrate test workloads across large GPU clusters and TypeScript with CDK to ensure all infrastructure is code, reviewed and committed to automated pipelines
- You manage shared development clusters using SLURM and AWS ParallelCluster, supporting multiple teams while keeping costs down
- You build automation that analyzes nightly test results and surfaces regressions to developers
- You write crisp designs for your projects, communicating clearly to your peers what you will build and why