PROJECT 3

Sponsored Research Project: Hardware-Efficient Serving by Layer-wise Adaptive Batching

Project Leader: See Jin Park, Assistant Professor, Computer Science Department

Website: https://seojinpark.net/

Abstract: Today’s large models are distributed across multiple GPUs to leverage greater memory capacity and achieve faster inference or training speeds. However, distributing and parallelizing these models introduces communication overheads, negatively impacting GPU efficiency in several ways. For instance, Mixture-of-Experts (MoE) models that utilize expert parallelism require all-to-all communication before and after each expert layer. This token shuffling leads to GPU stalls and forces the execution of expert layers with inefficient batch sizes. To address this issue, our approach employs layer-wise disaggregation and queuing. This allows individual GPUs to freely accumulate samples, enabling them to execute layers more efficiently.

PROJECT LEADER

See Jin Park