Many real-world graphs include crucial time domain data. Both spatial and temporal information are crucial in spatio-temporal applications such as traffic and weather forecasting.
Researchers have recently developed temporal graph neural networks (TGNN) to exploit temporal information in dynamic graphs, building on the success of graph neural networks (GNN) in learning static graph representation. TGNNs have demonstrated superior accuracy in a variety of downstream tasks, such as temporal link prediction and dynamic node classification on a variety of dynamic graphs, including social network graphs, traffic graphs, and knowledge graphs, significantly outperforming static GNNs and other conventional methods.
In dynamic graphs, as time passes, there are more events associated with each node. When this number is high, TGNNs cannot fully capture the history using temporal attention-based aggregation or historical neighbor sampling techniques. Researchers have created memory-based temporal graph neural networks (M-TGNN) that store memory vectors at the node level to summarize the history of independent nodes and compensate for lost history.
Despite the success of M-TGNNs, their poor scalability makes their implementation in large-scale production systems difficult. Due to the temporal dependencies generated by the auxiliary node memory, training mini-batches must be brief and scheduled in chronological sequence. Using data parallelism in M-TGNN training is particularly difficult in two ways:
- Simply increasing the batch size results in loss of information and loss of information about the temporal dependency between occurrences.
- All trainers must access and maintain a unified version of node memory, which creates a huge amount of remote traffic on distributed systems.
New research by the University of Southern California and AWS offers DistTGL, a scalable and efficient method for training M-TGNN on distributed GPU clusters. DistTGL improves current M-TGNN training systems in three ways:
- Model: The node memory accuracy and convergence rate of M-TGNNs are improved by introducing more static node memory.
- Algorithm: To address the problems of accuracy loss and communication overhead in sparse environments, the team provides a novel training algorithm.
- System: To reduce the overhead associated with mini-batch generation, they develop an optimized system using prefetch and pipeline techniques.
DistTGL significantly improves previous approaches in terms of convergence and training performance. DistTGL is the first effort to scale M-TGNN training to distributed GPU clusters. Github has DistTGL publicly available.
They present two innovative parallel training methodologies (epoch parallelism and memory parallelism) based on the distinctive properties of M-TGNN training, allowing M-TGNNs to capture the same number of dependent graph events on multiple GPUs as on a single GPU. GPU. Based on the data set and hardware characteristics, they provide heuristic recommendations to select the best training configurations.
Researchers serialize memory operations into node memory and effectively execute them through a separate daemon process, eliminating complicated and expensive synchronization to overlap mini-batch creation and GPU training. In tests, DistTGL outperforms the state-of-the-art single-machine approach by more than 10x when scaling to multiple GPUs in convergence rate.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our SubReddit of more than 30,000 ml, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>