Link prediction is a key technique for connecting entities and relationships in a graph reasoning field. It leverages known information about the graph structure data to predict missing factual information. Previous studies have either focused on the semantic representation of a single triplet or on the graph structure data built on triples. The former ignores the association between different triples, and the latter ignores the true meaning of the node itself. Furthermore, common graph-structured datasets inherently face challenges, such as missing information and incompleteness. In light of this challenge, we present a novel model called Multi-source Information Graph Embedding with Ensemble Learning for Link Prediction (EMGE), which can effectively improve the reasoning of link prediction. Ensemble learning is systematically applied throughout the model training process. At the data level, this approach enhances entity embeddings by integrating structured graph information and unstructured textual data as multi-source information inputs. The fusion of these inputs is effectively addressed by introducing an attention mechanism. During the training phase, the principle of ensemble learning is employed to extract semantic features from multiple neural network models, facilitating the interaction of enriched information. To ensure effective model learning, a novel loss function based on contrastive learning is devised, effectively minimizing the discrepancy between predicted values and the ground truth. Moreover, to enhance the semantic representation of graph nodes in link prediction, two rules are introduced during the aggregation of graph structure information. These rules incorporate the concept of spreading activation, enabling a more comprehensive understanding of the relationships between nodes and edges in the graph. During the testing phase, the EMGE model is validated on three datasets, including WN18RR, FB15k-237, and a private Chinese financial dataset. The experimental results demonstrate a reduction in the mean rank (MR) by 0.2 times, an improvement in the mean reciprocal rank (MRR) by 5.9%, and an increase in the Hit@1 by 12.9% compared to the baseline model.
Read full abstract