Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Kevin W Boyack,Russell J Duhon,Michael Patek,Bob Schijvenaars,Katy Börner,André Skupin,Nianli Ma,David Newman,Richard Klavans,Joseph R Biberstine

doi:10.1371/journal.pone.0018029

Abstract

BackgroundWe investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.MethodologyWe used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.ConclusionsPubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Highlights

Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts
Document clustering is important for a variety of information needs and applications such as collection management, summary and analysis
The use of search engines is far more a part of our lives than is the use of clustered document sets. This is as true in the world of biomedical literature as it is for any other literature; most studies related to enhancing the results of MEDLINE searches are very similar in nature to those being done in the broader information retrieval community [11,12,13]

Summary

Introduction

Document clustering is important for a variety of information needs and applications such as collection management, summary and analysis. Despite early efforts showing that document retrieval and document clustering are highly linked topics [5,6,7], most recent work using similarity measures is focused on improving the relevancy and ranking of search results [8,9,10] with little or no reference to the important task of clustering. This focus on information retrieval is not surprising given the overwhelming increase in the number and variety of documents available over the Internet, and through portals to scholarly literature such as the Web of Science, Scopus, and MEDLINE. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents

Objectives

Methods

Results

Conclusion