When using popular backpropagation as the default learning method, training deep neural networks (which can include hundreds of layers) can be a laborious process that can take weeks. Since the backpropagation learning algorithm is sequential, it is not easy to parallelize these models, although the process works well on a single computing unit. The gradient of each layer in the backpropagation depends on the gradient calculated in the layer below it. Because each node in a distributed system needs to wait for gradient information from its successor before continuing its calculations, long wait times between nodes result directly from this sequential dependency. Additionally, there can be a large communication overhead if nodes constantly communicate with each other to share weight and gradient data.
This becomes an even bigger problem when dealing with massive neural networks, where a large amount of data needs to be sent. The increasing size and complexity of neural networks has propelled distributed deep learning to new heights in recent years. Key solutions that have emerged include distributed training frameworks such as GPipe, PipeDream, and Flower. These frameworks optimize speed, usability, cost, and size, enabling the training of huge models. Data parallelism, pipeline, and modeling are some of the advanced approaches used by these systems to efficiently manage and perform large-scale neural network training across numerous processing nodes.
The Forward-Forward (FF) technique, which Hinton developed, offers a new method for training neural networks, in addition to previous studies focused on distributed backpropagation implementations. Unlike more conventional deep learning algorithms, the Forward-Forward algorithm performs all its calculations locally, layer by layer. In a distributed scenario, FF's layered training feature leads to a less dependent architecture, reducing downtime, communication, and synchronization. This contrasts with backpropagation, which focuses primarily on solving distributionless problems.
A new study from Sabanci University presents training distributed neural networks with a forward algorithm called the Pipeline Forward-Forward (PFF) algorithm. Because it does not impose backpropagation dependencies on the system, PFF achieves higher utilization of computational units with fewer bubbles and downtime. This fundamentally differs from classic implementations with backpropagation and pipeline parallelism. Experiments with PFF reveal that compared to the typical FF implementation, the PFF algorithm achieves the same level of accuracy and is four times faster.
Compared to an existing Distributed Forward-Forward (DFF) implementation, PFF achieves 5% higher accuracy in 10% fewer epochs, demonstrating even greater benefits. Because PFF only transmits the layer information (weights and biases), while DFF transmits all output data, the amount of data shared between layers in PFF is significantly less than that in DFF. Compared to DFF, this leads to lower communication overhead. Beyond PFF's notable results, the team hopes their study will open a new chapter in the field of training distributed neural networks.
The team also discusses various methods that exist to improve PFF.
- The current implementation of PFF allows parameters to be exchanged between multiple layers after each chapter. The team notes that it may be worth trying this swap after each batch if it helps fine-tune the weights and produces more accurate results. But there is a possibility of increased communication overhead.
- Using PFF in Federated Learning: Since PFF does not share data with other nodes during model training, it can be used to establish a federated learning system in which each node contributes its data.
- Sockets were used to establish communication between several nodes in the experiments carried out in this work. Transmitting data over a network adds additional communication overhead. The team suggests that a multi-GPU architecture, in which the PFF processing units are physically close to each other and share a resource, can significantly reduce the time needed to train a network.
- The Forward-Forward algorithm relies heavily on the generation of negative samples as it influences the learning process of the network. Therefore, higher system performance can surely be achieved by discovering novel and improved negative sample production methods.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
For content association, please Complete this form here.
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>