The aim of Medical Knowledge Graph Completion is to automatically predict one of three parts (head entity, relationship, and tail entity) in RDF triples from medical data, mainly text data. Following their introduction, the use of pretrained language models, such as Word2vec, BERT, and XLNET, to complete Medical Knowledge Graphs has become a popular research topic. The existing work focuses mainly on relationship completion and has rarely solved entities and related triples. In this paper, a framework to predict RDF triples for Medical Knowledge Graphs based on word embeddings (named PTMKG-WE) is proposed, for the specific use for the completion of entities and triples. The framework first formalizes existing samples for a given relationship from the Medical Knowledge Graph as prior knowledge. Second, it trains word embeddings from big medical data according to prior knowledge through Word2vec. Third, it can acquire candidate triples from word embeddings based on analogies from existing samples. In this framework, the paper proposes two strategies to improve the relation features. One is used to refine the relational semantics by clustering existing triple samples. Another is used to accurately embed the expression of the relationship through means of existing samples. These two strategies can be used separately (called PTMKG-WE-C and PTMKG-WE-M, respectively), and can also be superimposed (called PTMKG-WE-C-M) in the framework. Finally, in the current study, PubMed data and the National Drug File-Reference Terminology (NDF-RT) were collected, and a series of experiments was conducted. The experimental results show that the framework proposed in this paper and the two improvement strategies can be used to predict new triples for Medical Knowledge Graphs, when medical data are sufficiently abundant and the Knowledge Graph has appropriate prior knowledge. The two strategies designed to improve the relation features have a significant effect on the lifting precision, and the superposition effect becomes more obvious. Another conclusion is that, under the same parameter setting, the semantic precision of word embedding can be improved by extending the breadth and depth of data, and the precision of the prediction framework in this paper can be further improved in most cases. Thus, collecting and training big medical data is a viable method to learn more useful knowledge.
Read full abstract