Abstract

BackgroundThe quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications (particularly from different domains) can be missed. This has given rise to automatic literature based discovery (LBD). However, unless heavily filtered, LBD generates more potential new knowledge than can be manually verified and another form of selection is required before the results can be passed onto a user. Since a large proportion of the automatically generated hidden knowledge is valid but generally known, we investigate the hypothesis that non trivial, interesting, hidden knowledge can be treated as an anomaly and identified using anomaly detection approaches.ResultsTwo experiments are conducted: (1) to avoid errors arising from incorrect extraction of relations, the hypothesis is validated using manually annotated relations appearing in a thesaurus, and (2) automatically extracted relations are used to investigate the hypothesis on publication abstracts. These allow an investigation of a potential upper bound and the detection of limitations yielded by automatic relation extraction.ConclusionWe apply one-class SVM and isolation forest anomaly detection algorithms to a set of hidden connections to rank connections by identifying outlying (interesting) ones and show that the approach increases the F1 measure by a factor of 10 while greatly reducing the quantity of hidden knowledge to manually verify. We also demonstrate the statistical significance of this result.

Highlights

  • The quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications can be missed

  • We suggest re-ranking based on an anomaly detection algorithm, as this approach is highly suitable for datasets with very small numbers of outliers

  • Three separate cutoff dates are required for these experiments: the anomaly detection model is built from hidden knowledge generated from information up to date1 with gold standard annotation annotated from information up to date2

Read more

Summary

Introduction

The quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications ( from different domains) can be missed. Literature based discovery (LBD) attempts to automatically address the fact that the volume of publications produced daily forces researchers to restrict the number of articles they read, potentially resulting in inferable connections being missed – for example, in the biomedical domain, Swanson [1] found one publication mentioning Raynaud disease as affecting blood viscosity, platelet aggregation, and vascular reactivity, and another stating that fish oil has the opposite effect on the same, but the connection between Raynaud disease and fish oil had not been noticed This forms the outline of the A-B-C model [1] which extracts all pairs of A and B that are known to be related (such as Raynaud disease - blood viscosity) and matches over B terms to find connections A - B - C where A - B appear in one publication and B - C in another but no single publication connects A directly to C. The order can be determined by the number of linking (B) terms (LTs, e.g. [6]), computed confidence values (e.g. [7]), or by assigning weights and rankings to the LTs based on medical subject headings (e.g. [4])

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call