Busy GPUs: Sample-and-pipe method speeds up deep learning on large graphics | MIT News

Graphs, a potentially extensive network of nodes connected by edges, can be used to express and interrogate relationships between data such as social connections, financial transactions, traffic, energy networks, and molecular interactions. As researchers collect more data and build these graphical images, they will need faster and more efficient methods, as well as more computational power, to perform deep learning on them, in the form of graphical neural networks (GNNs).

Now, a new method, called SAmpling, sLIcing, and data movemeNT (SAmpling, and data movemeNT), developed by researchers at MIT and IBM Research, improves training and inference performance by addressing three key bottlenecks in computing. This drastically reduces the execution time of GNN on large data sets that, for example, contain a scale of 100 million nodes and 1 billion edges. Furthermore, the team found that the technique scales well when adding computational power from one to 16 graphics processing units (GPUs). The work was presented at the V Conference on Machine Learning and Systems.

“We started looking at the challenges existing systems were experiencing when scaling next-generation machine learning techniques for graphs to really large data sets. It turned out that there was a lot of work to be done, because many of the existing systems were achieving good performance mainly on smaller data sets that fit in the GPU memory,” says Tim Kaler, lead author and postdoc in Computer Science at MIT. and Artificial Intelligence Laboratory (CSAIL).

By vast data sets, experts refer to scales like the entire Bitcoin network, where certain data patterns and relationships could explain trends or foul play. “There are nearly 1 billion Bitcoin transactions on the blockchain, and if we want to identify illicit activities within such a joint network, then we are faced with such a scale graph,” says co-author Jie Chen, Principal Research Scientist. and manager. from IBM Research and the MIT-IBM Watson AI Lab. “We want to build a system that is capable of handling that type of graph and allows processing to be as efficient as possible, because every day we want to keep up with the new data that they generate”.

Kaler and Chen’s co-authors include Jump Trading’s Nickolas Stathas MEng ’21, who developed SALIENT as part of his graduate work; former MIT-IBM Watson AI Lab fellow and MIT graduate student Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; Tao B. Schardl, MIT CSAIL research scientist; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher at the MIT-IBM Watson AI Lab.

For this problem, the team took a systems-oriented approach when developing their method: SALIENT, Kaler says. To do this, the researchers implemented what they considered major core optimizations of components that fit into existing machine learning frameworks, such as PyTorch Geometric and the Deep Graphing Library (DGL), which are interfaces to building a machine learning model. Stathas says the process is like changing engines to build a faster car. His method was designed to fit into existing GNN architectures so that domain experts could easily apply this work to their specific fields to speed up model training and gain insights during inference faster. The team determined that the trick was to keep all the hardware (CPU, datalinks, and GPU) busy at all times: while the CPU samples the graph and prepares mini batches of data that will then be transferred over the datalink. , the most critical GPU is working to train the machine learning model or make inferences.

The researchers began by analyzing the performance of a commonly used machine learning library for GNN (PyTorch Geometric), which showed surprisingly low utilization of available GPU resources. Applying simple optimizations, the researchers improved GPU utilization from 10 to 30 percent, resulting in a 1.4 to 2x performance improvement relative to public benchmark codes. This quick benchmark code could run a full pass over a large training data set through the algorithm (an epoch) in 50.4 seconds.

Looking for additional performance improvements, the researchers set out to examine the bottlenecks that occur at the beginning of the data pipeline: the algorithms for graph sampling and mini-batch preparation. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which computes information about a node using information present at other nearby nodes in the graph; for example, in a social network graph, information from friends of friends of a user. As the number of layers in the GNN increases, the number of nodes the network has to reach to obtain information can explode, exceeding the limits of a computer. Neighborhood sampling algorithms help by selecting a smaller random subset of nodes to collect; however, the researchers found that current implementations of this were too slow to keep up with the processing speed of modern GPUs. In response, they identified a combination of data structures, algorithmic optimizations, etc., that improved the sampling rate and ultimately improved the sampling operation only about three times, bringing the execution time per epoch to 50 .4 to 34.6 seconds. They also found that sampling, at an appropriate rate, can be performed during inference, improving efficiency and overall energy yield, a point that had been overlooked in the literature, the team notes.

In previous systems, this sampling step was a multi-process approach, creating additional data and unnecessary data movement between processes. The researchers made their SALIENT method more agile by creating a single process with lightweight threads that kept data on the CPU in shared memory. In addition, SALIENT takes advantage of a modern processor cache, Stathas says, by parallelizing function slicing, which pulls relevant information from nodes of interest and their surrounding neighbors and edges, into shared memory from the CPU’s core cache. This again reduced the overall run time per epoch from 34.6 to 27.8 seconds.

The last bottleneck the researchers addressed was funneling mini-batch data transfers between the CPU and GPU using a prefetch step, which would prepare the data just before it is needed. The team calculated that this would maximize the use of bandwidth on the data link and bring the method to perfect utilization; however, they only saw about 90 percent. They identified and fixed a performance bug in a popular PyTorch library that caused unnecessary round-trip communications between the CPU and GPU. With this bug fixed, the team achieved a runtime of 16.5 seconds per epoch with SALIENT.

“I think our work showed that the devil is in the details,” says Kaler. “When you pay close attention to the details that affect performance when training a graphical neural network, you can solve a large number of performance problems. With our solutions, we end up being completely bogged down by GPU compute, which is the ideal goal of such a system.”

SALIENT’s speed was tested on three standard data sets ogbn-arxiv, ogbn-products, and ogbn-papers100M, as well as multi-machine configurations, with different levels of fanout (amount of data the CPU would prepare for the GPU) and through various architectures, including the latest state-of-the-art technology, GraphSAGE-RI. In every configuration, SALIENT outperformed PyTorch Geometric, especially on the large ogbn-papers100M dataset, which contains 100 million nodes and over a billion edges. Here, it was three times faster, running on a GPU, than the optimized baseline originally created for this job; with 16 GPUs, SALIENT was eight times faster.

While other systems had slightly different hardware and experimental setups, so it wasn’t always a direct comparison, SALIENT still outperformed them. Among systems that achieved similar accuracy, representative performance figures include 99 seconds with a GPU and 32 CPUs, and 13 seconds with 1536 CPUs. By contrast, SALIENT’s runtime with one GPU and 20 CPUs was 16.5 seconds and was just two seconds with 16 GPUs and 320 CPUs. “If you look at the final numbers that previous jobs report, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers that were previously reported in this data set,” says Kaler. The researchers attributed its performance improvements, in part, to its approach of optimizing its code for a single machine before moving to a distributed setup. Stathas says the lesson here is that, for your money, “it makes more sense to use the hardware you have extremely efficiently, before you start scaling out to multiple computers,” which can provide significant cost and carbon savings. . that can come with model training.

This new capability will now allow researchers to tackle and drill down into larger and larger graphs. For example, the Bitcoin network mentioned above contained 100,000 nodes; the SALIENT system can cleverly handle a graphic 1000 times (or three orders of magnitude) larger.

“In the future, we would look to not only run this graphical neural network training system on the existing algorithms we implemented to classify or predict the properties of each node, but also want to do deeper tasks, such as identifying common patterns in a graph ( subgraph patterns), [which] it can be really interesting to indicate financial crimes,” says Chen. “We also want to identify nodes in a graph that are similar in the sense that they possibly correspond to the same criminal in financial crime. These tasks would require developing additional algorithms and possibly neural network architectures as well.”

This research was supported by the MIT-IBM Watson AI Lab and in part by the US Air Force Research Laboratory and the US Air Force Artificial Intelligence Accelerator.