Abstract

Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce frame-work. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.

Highlights

  • Text mining of biomedical literature is a research area that has attracted lot of attention in the last 5 to 10 years

  • Hypotheses generation as supervised link discovery on biomedical concept network We model a biomedical literature as a concept network G, where each node represents a biomedical concept that belongs to certain semantic type, and each edge represents a relationship between two concepts

  • Since predictions are carried out based on a classification model that is built upon a training data set extracted from two consecutive snapshots of the concept network, the performance of link discovery can be evaluated by measures such as classification accuracy, recall, and precision as results of n-fold cross validation on the training data

Read more

Summary

Introduction

Text mining of biomedical literature is a research area that has attracted lot of attention in the last 5 to 10 years. If we model a biomedical literature repository as a comprehensive network of biomedical concepts belonging to different semantic types, the link discovery techniques may enable large-scale, cross-silo hypotheses discovery that goes beyond information retrieval-based discovery. Concept network creation and feature extraction using Map-Reduce framework we describe the implementation of the computational model presented in the Hypotheses generation as supervised link discovery on biomedical concept network section. Automatic generation of class labels for concept pairs Given two snapshots Gtf and Gts of the concept network corresponding to two consecutive time duration tf and ts, we generate a group of labeled pairs based on which a training data set can be formed for the proposed supervised link discovery. Algorithm 2: Generating the snapshot of the concept network, Gt, for a time duration t

Result
Conclusions
Swanson DR
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call