Abstract

BackgroundThe complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.ResultsIt is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely “has target”, and “may treat”, are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.ConclusionsAnalysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.Electronic supplementary materialThe online version of this article (doi:10.1186/s13326-015-0021-5) contains supplementary material, which is available to authorized users.

Highlights

  • The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases

  • Indirect knowledge connecting two concepts cs and ct is defined as a sequence of triples starting with concept cs and ending in concept ct, where the second concept of each triple must be equal to the first concept of its following triple

  • The pruned, unstructured part of the knowledge graph contains 84,635 vertices and around 39 million edges with 104,953 different labels between around 9 million connected concept pairs. Another 2.8 million pairs for relations stemming from Unified Medical Language System (UMLS) and DrugBank were added to the graph as edges, but no new concepts were introduced, because the graph would have grown too large if all concepts of the UMLS would have been included as length m = 4

Read more

Summary

Introduction

The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. Motivation and objectives Knowledge discovery is an important field of research, especially in the biomedical domain, in which the scale and growth of accumulated knowledge of all kinds is already beyond the capabilities of a single human to keep up with This has motivated research towards mining knowledge from heterogeneous data of both structured and unstructured knowledge bases (KBs).

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call