GraphStorm is a low-code, enterprise graph machine learning (GML) framework for building, training, and deploying graph machine learning solutions on complex, enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly consider the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search and retrieval problems.
Today we released GraphStorm 0.3, which adds native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets on different nodes and edges within a single training loop. Additionally, GraphStorm 0.3 adds new APIs for customizing GraphStorm pipelines: you now only need 12 lines of code to implement a custom node classification training loop. To help you get started with the new API, we’ve published two example Jupyter notebooks: one for node classification and one for a link prediction task. We also published a comprehensive study of co-training language models (LMs) and graph neural networks (GNNs) for large graphs with rich text features using the Microsoft Academic Graph (MAG) dataset from our KDD Paper 2024The study shows the performance and scalability of GraphStorm on text-rich graphs and best practices for configuring GML training loops to achieve better performance and efficiency.
Native support for multi-task learning on graphs
Many enterprise applications have graph data associated with multiple tasks at different nodes and edges. For example, retail organizations want to perform fraud detection on both sellers and buyers. Scientific publishers want to find more related work to cite in their articles and need to select the right topic to make their publication discoverable. To better model such applications, customers have asked us to support multi-task learning on graphs.
GraphStorm 0.3 supports multi-task learning on graphs with six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify training goals through a YAML configuration file. For example, a scientific editor can use the following YAML configuration to simultaneously define a paper topic classification task: paper
nodes and a link prediction task in paper-citing-paper
Edges for the scientific publishing use case:
For more details on how to run multi-task graph learning with GraphStorm, see Multi-task learning in GraphStorm in our documentation.
New APIs to customize GraphStorm components and pipelines
Since GraphStorm’s release in early 2023, customers have primarily used its command-line interface (CLI), which abstracts away the complexity of graph ML scripting so you can quickly build, train, and deploy models using common recipes. However, customers tell us they want an interface that lets them customize GraphStorm’s training and inference script to their specific requirements more easily. Based on customer feedback on the experimental APIs we released in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML scripting APIs. With the new APIs, you only need 12 lines of code to define a custom node classification training script, as illustrated in the following example:
To help you get started with the new APIs, we have also launched New Jupyter Notebook Examples in our Documentation and tutorials page.
In-depth study of LM+GNN for large graphs with rich text features
Many enterprise applications have graphs with text features. In retail search applications, for example, purchase record data provides insights into how text-rich product descriptions, search queries, and customer behavior are related. Basic large language models (LLMs) alone are not well suited for modeling such data because the underlying data distributions and relationships do not correspond to what LLMs learn from their pre-training data corpora. GML, on the other hand, is great at modeling related data (graphs), but until now, GML practitioners had to manually combine their GML models with LLMs to model text features and obtain the best performance for their use cases. Especially when the underlying graph dataset was large, this manual work was challenging and time-consuming.
In GraphStorm 0.2, GraphStorm introduced built-in techniques to train language models (LMs) and GNN models efficiently and at scale on massive, text-rich graphs. Since then, customers have asked us for guidance on how GraphStorm’s LM+GNN techniques should be employed to optimize performance. To address this, with GraphStorm 0.3, we released an LM+GNN benchmark using the large graph dataset, Microsoft Academic Graph (MAG), on two standard graph ML tasks: node classification and link prediction. The graph dataset is a heterogeneous graph, containing hundreds of millions of nodes and billions of edges, and most nodes are attributed with rich text features. Detailed statistics for the datasets are shown in the following table.
Data set | Number of nodes | Number of edges | Number of node/edge types | Number of nodes in the training set NC | Number of edges in the LP training set | Number of nodes with text features |
MAGAZINE | 484.511.504 | 7.520.311.838 | 4/4 | 28,679,392 | 1,313,781,772 | 240.955.156 |
At GraphStorm, we evaluate two main LM-GNN methods: pre-trained BERT+GNN, a widely adopted baseline method, and refined BERT+GNN, introduced by GraphStorm developers in 2022. With the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features and then train a GNN model for prediction. With the refined BERT+GNN method, we initially fine-tune BERT models on graph data and use the resulting fine-tuned BERT model to compute embeddings that are then used to train GNN models for prediction. GraphStorm offers different ways to fine-tune BERT models, depending on the types of tasks. For node classification, we fine-tune the BERT model on the training set using node classification tasks; for link prediction, we fine-tune the BERT model using link prediction tasks. In the experiment, we used 8 r5.24xlarge instances for data processing and 4 g5.48xlarge instances for model training and inference. The fine-tuned BERT+GNN approach performs up to 40% better (link prediction on MAG) compared to the pre-trained BERT+GNN.
The following table shows the model performance of the two methods and the total computation time of the whole process from data processing to graph construction. NC stands for node classification and LP stands for link prediction. The time cost of LM means the time spent computing BERT embeddings and the time spent fine-tuning BERT models for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.
Data set | Task | Data processing time | Aim | BERT + GNN pretrained | BERT + GNN fine-tuned | ||||
LM time cost | A time | Metric | LM time cost | A time | Metric | ||||
MAGAZINE | NORTH CAROLINA | 553 minutes | Theme of the work | 206 minutes | 135 minutes | Accumulation: 0.572 | 1423 minutes | 137 minutes | Accumulation: 0.633 |
LP | quote | 198 minutes | 2195 minutes | Srr: 0.487 | 4508 minutes | 2172 minutes | Srr: 0.684 |
We also evaluate GraphStorm on large synthetic graphs to demonstrate its scalability. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time for graph preprocessing, graph partitioning, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion-scale graphs in a matter of hours.
Graphic Size | Data preprocessing | Partitioning of graphs | Model training | |||
# instances | Time | # instances | Time | # instances | Time | |
1 B | 4 | 19 minutes | 4 | 8 minutes | 4 | 1.5 minutes |
10B | 8 | 31 minutes | 8 | 41 minutes | 8 | 8 minutes |
100B | sixteen | 61 minutes | sixteen | 416 minutes | sixteen | 50 minutes |
More details and benchmark results are available in our KDD Paper 2024.
Conclusion
GraphStorm 0.3 is released under the Apache-2.0 license to help you tackle your large-scale graph machine learning challenges and now offers native support for multi-task learning and new APIs for customizing pipelines and other GraphStorm components. Check out the GraphStorm GitHub Repository and documentation For a start.
About the Author
Song Xiang is a Senior Applied Scientist at AWS ai Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph databases. He now leads the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He earned his PhD in Computer Systems and Architecture from Fudan University, Shanghai in 2014.
Jian Zhang Zhang Zhang is a senior applied scientist who has been using machine learning techniques to help customers solve various problems such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning solutions, particularly graph neural networks, for customers in China, the US, and Singapore. As an illustrator of AWS's graph capabilities, Zhang has made many public presentations on GNN, Deep Graph Library (DGL), amazon Neptune, and other AWS services.
Florian Saupe Florian is a Principal Technical Product Manager at AWS ai/ML Research, supporting science teams such as the Graph Machine Learning group and ML Systems teams working on large-scale distributed training, inference, and failure resilience. Prior to joining AWS, Florian led technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist, a field in which he holds a PhD.