Publication

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. Most network optimizations are under the assumption of fixed collective communication algorithms and parameters, limiting the potential performance gains. AutoCCL proposes an automated framework that tunes collective communication configurations to significantly improve training throughput.

NSDI 2025 / April 2025
collective communicationDNN trainingdistributed systemsauto-tuning

Authors

Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, Zewen Jin, Youshan Miao, Cheng Li

Abstract

The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. AutoCCL proposes an automated framework that tunes collective communication configurations to significantly improve training throughput.