Distributed training refers to splitting a training task across multiple computing nodes according to certain methods, and then aggregating and updating the gradients and other information obtained from the split computations. PaddlePaddle’s distributed training technology originates from Baidu’s business practices and has been validated in ultra-large-scale business scenarios in fields such as natural language processing, computer vision, search, and recommendation. High-performance distributed training is one of PaddlePaddle’s core technical advantages. For example, in tasks such as image classification, distributed training can achieve nearly linear speedup. Take ImageNet as an example, the ImageNet22k dataset contains 14 million images, and training on a single GPU would be extremely time-consuming. Therefore, PaddleX supports distributed training interfaces to complete training tasks, supporting both single-machine and multi-machine training. For more methods and documentation on distributed training, please refer to:Distributed Training Quick Start Tutorial。
Taking Image Classification Model Trainingas an example, compared to single-machine training, for multi-machine training you only need to add the Train.dist_ips parameter, which indicates the list of IP addresses of machines participating in distributed training, separated by commas. Below is a sample code to run.
python main.py -c paddlex/configs/modules/image_classification/PP-LCNet_x1_0.yaml \
-o Global.mode=train \
-o Global.dataset_dir=./dataset/cls_flowers_examples
-o Train.dist_ips="xx.xx.xx.xx,xx.xx.xx.xx"
Note:
ifconfig or ipconfig.Train.dist_ips list will be trainer0, and so on.