GraphStorm 0.3: Scalable, multitasking learning on graphs with easy-to-use APIs

GraphStorm is a low-code, enterprise graph machine learning (GML) framework for building, training, and deploying graph machine learning solutions on complex, enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly consider the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search and retrieval problems.

Today we released GraphStorm 0.3, which adds native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets on different nodes and edges within a single training loop. Additionally, GraphStorm 0.3 adds new APIs for customizing GraphStorm pipelines: you now only need 12 lines of code to implement a custom node classification training loop. To help you get started with the new API, we’ve published two example Jupyter notebooks: one for node classification and one for a link prediction task. We also published a comprehensive study of co-training language models (LMs) and graph neural networks (GNNs) for large graphs with rich text features using the Microsoft Academic Graph (MAG) dataset from our KDD Paper 2024The study shows the performance and scalability of GraphStorm on text-rich graphs and best practices for configuring GML training loops to achieve better performance and efficiency.

Native support for multi-task learning on graphs

Many enterprise applications have graph data associated with multiple tasks at different nodes and edges. For example, retail organizations want to perform fraud detection on both sellers and buyers. Scientific publishers want to find more related work to cite in their articles and need to select the right topic to make their publication discoverable. To better model such applications, customers have asked us to support multi-task learning on graphs.

GraphStorm 0.3 supports multi-task learning on graphs with six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify training goals through a YAML configuration file. For example, a scientific editor can use the following YAML configuration to simultaneously define a paper topic classification task: paper nodes and a link prediction task in paper-citing-paper Edges for the scientific publishing use case:

version: 1.0
    gsf:
        basic: # basic settings of the backbone GNN model
            ...
        ...
        multi_task_learning:
            - node_classification:         # define a node classification task for paper subject prediction.
                target_ntype: "paper"      # the paper nodes are the training targets.
                label_field: "label_class" # the node feature "label_class" contains the training labels.
				mask_fields:
                    - "train_mask_class"   # train mask is named as train_mask_class.
                    - "val_mask_class"     # validation mask is named as val_mask_class.
                    - "test_mask_class"    # test mask is named as test_mask_class.
                num_classes: 10            # There are total 10 different classes (subject) to predict.
                task_weight: 1.0           # The task weight is 1.0.
                
            - link_prediction:                # define a link prediction paper citation recommendation.
                num_negative_edges: 4         # Sample 4 negative edges for each positive edge during training
                num_negative_edges_eval: 100  # Sample 100 negative edges for each positive edge during evaluation
                train_negative_sampler: joint # Share the negative edges between positive edges (to speedup training)
                train_etype:
                    - "paper,citing,paper"    # The target edge type for link prediction training is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # train mask is named as train_mask_lp.
                    - "val_mask_lp"           # validation mask is named as val_mask_lp.
                    - "test_mask_lp"          # test mask is named as test_mask_lp.
                task_weight: 0.5              # The task weight is 0.5.

For more details on how to run multi-task graph learning with GraphStorm, see Multi-task learning in GraphStorm in our documentation.

New APIs to customize GraphStorm components and pipelines

Since GraphStorm’s release in early 2023, customers have primarily used its command-line interface (CLI), which abstracts away the complexity of graph ML scripting so you can quickly build, train, and deploy models using common recipes. However, customers tell us they want an interface that lets them customize GraphStorm’s training and inference script to their specific requirements more easily. Based on customer feedback on the experimental APIs we released in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML scripting APIs. With the new APIs, you only need 12 lines of code to define a custom node classification training script, as illustrated in the following example:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=('paper')), fanout=(20, 20), batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=('paper')), fanout=(100, 100), batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=('paper')), fanout=(100, 100), batch_size=256, train_task=False)

model = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

trainer = gs.trainer.GSgnnNodePredictionTrainer(model)
trainer.setup_evaluator(evaluator)

trainer.fit(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

To help you get started with the new APIs, we have also launched New Jupyter Notebook Examples in our Documentation and tutorials page.

In-depth study of LM+GNN for large graphs with rich text features

Many enterprise applications have graphs with text features. In retail search applications, for example, purchase record data provides insights into how text-rich product descriptions, search queries, and customer behavior are related. Basic large language models (LLMs) alone are not well suited for modeling such data because the underlying data distributions and relationships do not correspond to what LLMs learn from their pre-training data corpora. GML, on the other hand, is great at modeling related data (graphs), but until now, GML practitioners had to manually combine their GML models with LLMs to model text features and obtain the best performance for their use cases. Especially when the underlying graph dataset was large, this manual work was challenging and time-consuming.

In GraphStorm 0.2, GraphStorm introduced built-in techniques to train language models (LMs) and GNN models efficiently and at scale on massive, text-rich graphs. Since then, customers have asked us for guidance on how GraphStorm’s LM+GNN techniques should be employed to optimize performance. To address this, with GraphStorm 0.3, we released an LM+GNN benchmark using the large graph dataset, Microsoft Academic Graph (MAG), on two standard graph ML tasks: node classification and link prediction. The graph dataset is a heterogeneous graph, containing hundreds of millions of nodes and billions of edges, and most nodes are attributed with rich text features. Detailed statistics for the datasets are shown in the following table.

Data set	Number of nodes	Number of edges	Number of node/edge types	Number of nodes in the training set NC	Number of edges in the LP training set	Number of nodes with text features
MAGAZINE	484.511.504	7.520.311.838	4/4	28,679,392	1,313,781,772	240.955.156

At GraphStorm, we evaluate two main LM-GNN methods: pre-trained BERT+GNN, a widely adopted baseline method, and refined BERT+GNN, introduced by GraphStorm developers in 2022. With the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features and then train a GNN model for prediction. With the refined BERT+GNN method, we initially fine-tune BERT models on graph data and use the resulting fine-tuned BERT model to compute embeddings that are then used to train GNN models for prediction. GraphStorm offers different ways to fine-tune BERT models, depending on the types of tasks. For node classification, we fine-tune the BERT model on the training set using node classification tasks; for link prediction, we fine-tune the BERT model using link prediction tasks. In the experiment, we used 8 r5.24xlarge instances for data processing and 4 g5.48xlarge instances for model training and inference. The fine-tuned BERT+GNN approach performs up to 40% better (link prediction on MAG) compared to the pre-trained BERT+GNN.

The following table shows the model performance of the two methods and the total computation time of the whole process from data processing to graph construction. NC stands for node classification and LP stands for link prediction. The time cost of LM means the time spent computing BERT embeddings and the time spent fine-tuning BERT models for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Data set	Task	Data processing time	Aim	BERT + GNN pretrained			BERT + GNN fine-tuned
Data set	Task	Data processing time	Aim	LM time cost	A time	Metric	LM time cost	A time	Metric
MAGAZINE	NORTH CAROLINA	553 minutes	Theme of the work	206 minutes	135 minutes	Accumulation: 0.572	1423 minutes	137 minutes	Accumulation: 0.633
MAGAZINE	LP	553 minutes	quote	198 minutes	2195 minutes	Srr: 0.487	4508 minutes	2172 minutes	Srr: 0.684

We also evaluate GraphStorm on large synthetic graphs to demonstrate its scalability. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time for graph preprocessing, graph partitioning, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion-scale graphs in a matter of hours.

Graphic Size	Data preprocessing		Partitioning of graphs		Model training
Graphic Size	# instances	Time	# instances	Time	# instances	Time
1 B	4	19 minutes	4	8 minutes	4	1.5 minutes
10B	8	31 minutes	8	41 minutes	8	8 minutes
100B	sixteen	61 minutes	sixteen	416 minutes	sixteen	50 minutes

More details and benchmark results are available in our KDD Paper 2024.

Conclusion

GraphStorm 0.3 is released under the Apache-2.0 license to help you tackle your large-scale graph machine learning challenges and now offers native support for multi-task learning and new APIs for customizing pipelines and other GraphStorm components. Check out the GraphStorm GitHub Repository and documentation For a start.

About the Author

Song Xiang is a Senior Applied Scientist at AWS ai Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph databases. He now leads the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He earned his PhD in Computer Systems and Architecture from Fudan University, Shanghai in 2014.

Jian Zhang Zhang Zhang is a senior applied scientist who has been using machine learning techniques to help customers solve various problems such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning solutions, particularly graph neural networks, for customers in China, the US, and Singapore. As an illustrator of AWS's graph capabilities, Zhang has made many public presentations on GNN, Deep Graph Library (DGL), amazon Neptune, and other AWS services.

Florian Saupe Florian is a Principal Technical Product Manager at AWS ai/ML Research, supporting science teams such as the Graph Machine Learning group and ML Systems teams working on large-scale distributed training, inference, and failure resilience. Prior to joining AWS, Florian led technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist, a field in which he holds a PhD.

GraphStorm 0.3: Scalable, multitasking learning on graphs with easy-to-use APIs

Technical Terrence Team

Berkshire halves Apple stake, boosts cash to $277 billion as it takes a 'defensive' stance By Reuters

Leave a Reply Cancel reply

Recommended.

The ecological GBTC reaches $600,000 in pre-sale

Best headphones for students | Technology and learning

9 great books on essay writing

Crypto Analyst Says Bitcoin Hitting $100,000 in 2024 Is Inevitable, Here's Why

This is what you need to know

Categories

Important Links

GraphStorm 0.3: Scalable, multitasking learning on graphs with easy-to-use APIs

Native support for multi-task learning on graphs

New APIs to customize GraphStorm components and pipelines

In-depth study of LM+GNN for large graphs with rich text features

Conclusion

About the Author

Related

Technical Terrence Team

Berkshire halves Apple stake, boosts cash to $277 billion as it takes a 'defensive' stance By Reuters

Leave a Reply Cancel reply

Recommended.

The ecological GBTC reaches $600,000 in pre-sale

Best headphones for students | Technology and learning

9 great books on essay writing

Crypto Analyst Says Bitcoin Hitting $100,000 in 2024 Is Inevitable, Here's Why

This is what you need to know

Categories

Important Links

Get daily news updates to your inbox!