Discovering themes in biomedical literature using a projection-based algorithm

Lana Yeganova,Sun Kim,Grigory Balasanov,W John Wilbur

doi:10.1186/s12859-018-2240-0

Lana Yeganova, Sun Kim + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-018-2240-0

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundThe need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously.ResultsWe present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed Ⓡ documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method.ConclusionsThis study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms.

Highlights

The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available
We extend the projection algorithm to the theme algorithm which detects a theme through an iterative process as follows: it cycles through steps in computing the consensus vector and refining the document set until the theme becomes stable
We describe the connection between the projection algorithm and the power iteration method, and provide conditions that guarantee the convergence of the projection algorithm to the first singular vector of the data matrix

Summary

Results

We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. A novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method

Conclusions

Background

Methods

Results and discussion

Method

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 16, 2018
Citations: 1	License type: open-access

R Discovery Prime

Discovering themes in biomedical literature using a projection-based algorithm

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Self-Tuned Descriptive Document Clustering using a Predictive Network
K Syed Kousar Niasi ... P Sidheshwari
International Journal of Scientific Research in Science, Engineering and Technology | VOL. 6
K Syed Kousar Niasi, et. al.K Syed Kousar Niasi ... P Sidheshwari
01 May 2019
International Journal of Scientific Research in Science, Engineering and Technology | VOL. 6

Document Clustering and Topic Modeling: A Unified Bayesian Probabilistic Perspective
Gianni Costa ... Riccardo Ortale
-
Gianni Costa, et. al.Gianni Costa ... Riccardo Ortale
01 Nov 2019
01 Nov 2019

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
Stephan A Curiskis ... Paul J Kennedy
Information Processing & Management | VOL. 57
Stephan A Curiskis, et. al.Stephan A Curiskis ... Paul J Kennedy
17 Apr 2019
Information Processing & Management | VOL. 57

Winning hegemony : a discourse-theoretical perspective on political strategy and the trade politics of the European Parliament

-

09 Jul 2020
09 Jul 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Discovering themes in biomedical literature using a projection-based algorithm

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics