Background: The task of relation extraction is a crucial component in the construction of a knowledge graph. However, it often necessitates a significant amount of manual annotation, which can be time-consuming and expensive. Distant supervision, as a technique, seeks to mitigate this challenge by generating a large volume of pseudo-training data at a minimal cost, achieved by mapping triple facts onto the raw text. Objective: The aim of this study is to explore the novelty and potential of the distant supervisionbased relation extraction approach. By leveraging this innovative method, we aim to enhance knowledge reliability and facilitate new knowledge discovery, establishing associations between knowledge from specific biomedical data or existing knowledge graphs and literature. Methods: This study presents a methodology to construct a biomedical knowledge graph employing distant supervision techniques. Through establishing links between knowledge entities and relevant literature sources, we methodically extract and integrate information, thereby expanding and enriching the knowledge graph. This study identified five types of biomedical entities (e.g., diseases, symptoms and genes) and four kinds of relationships. These were linked to PubMed literature and divided into training and testing datasets. To mitigate data noise, the training set underwent preprocessing, while the testing set was manually curated. Results: In our research, we successfully associated 230,698 triples from the existing knowledge graph with relevant literature. Furthermore, we identified additional 205,148 new triples directly sourced from these studies. Conclusion: Our study markedly advances the field of biomedical knowledge graph enrichment, particularly in the context of Traditional Chinese Medicine (TCM). By validating a substantial number of triples through literature associations and uncovering over 200,000 new triples, we have made a significant stride in promoting the development of evidence-based medicine in TCM. The results underscore the potential of using a distant supervision-based relation extraction approach to both validate and expand knowledge bases, contributing to the broader progression of evidence-based practices in the realm of TCM.
Read full abstract