The goal of dynamic link property prediction is to predict the property (often the existence) of a link between a pair of nodes at a future timestamp.
Negative edge sampling. In real applications, the true edges are not known in advance. Therefore, a large number of node pairs are queried and only the pairs with the highest scores are treated as edges. Motivated by this, we frame the link prediction task as a classification problem and sample multiple negative edges for each positive edge. In particular, for a given positive edge (s,d,t)we fix the source node yes and timestamp t and shows q different destination nodes d. For each data set, q It is selected based on the balance between the completeness of the evaluation and the inference time of the test suite. Outside q negative samples, half are sampled uniformly at random, while the other half are historical negative edges (edges that were observed in the training set but are not present at the moment). t).
Performance metrics. We use the filtered mean reciprocal rank (MRR) as a metric for this task since it is designed for classification problems. The MRR calculates the reciprocal rank of the true destination node among the negative or false destinations and is commonly used in recommender systems and knowledge graph literature.
Results on small data sets. in the small tgbl-wiki
and tgbl-review
data sets, we observe that the best performing models are quite different. Furthermore, the models with the best performance in tgbl-wiki
such as CAWN and NAT have a significant reduction in performance in tgbl-review
. A possible explanation is that the tgbl-review
The data set has a much higher surprise rate compared to the tgbl-wiki
data set. The high surprise rate shows that a high proportion of edges from the test set are never observed in the training set, thus tgbl-review
requires more inductive reasoning. In tgbl-review
, GraphMixer and TGAT are the best performing models. Due to its smaller size, we can sample all possible negatives to tgbl-wiki
and a hundred negatives tgbl-review
by positive edge.
Most methods run out of GPU memory for these datasets, so we compare TGN, DyRep, and Edgebank on these datasets due to their lower GPU memory requirement. Note that some data sets like tgbl-comment
either tgbl-flight
spans several years, which could result in a distribution shift over its long time period.
Perspectives. As seen above in tgbl-wiki
, the number of negative samples used for evaluation can significantly affect the performance of the model: we see a significant drop in performance for most methods, when the number of negative samples increases from 20 to all possible targets. This verifies that more negative samples are indeed required for a robust evaluation. Interestingly, methods like CAWN and Edgebank have a relatively smaller drop in performance and we leave it as future work to investigate why certain methods are less affected.
Next, we see a difference of up to two orders of magnitude in the training and validation time of the TG methods, with the Edgebank heuristic baseline always being the fastest (since it is simply implemented as a hash table). This shows that improving model efficiency and scalability is an important future direction, so that new and existing models can be tested on large data sets provided in TGB.