A comparison of large-scale science models based on textual, direct citation and hybrid relatedness

Kevin W Boyack,Richard Klavans

doi:10.1162/qss_a_00085

Abstract

Recent large-scale bibliometric models have largely been based on direct citation, and several recent studies have explored augmenting direct citation with other citation-based or textual characteristics. In this study we compare clustering results from direct citation, extended direct citation, a textual relatedness measure, and several citation-text hybrid measures using a set of nine million documents. Three different accuracy measures are employed, one based on references in authoritative documents, one using textual relatedness, and the last using document pairs linked by grants. We find that a hybrid relatedness measure based equally on direct citation and PubMed-related article scores gives more accurate clusters (in the aggregate) than the other relatedness measures tested. We also show that the differences in cluster contents between the different models are even larger than the differences in accuracy, suggesting that the textual and citation logics are complementary. Finally, we show that for the hybrid measure based on direct citation and related article scores, the larger clusters are more oriented toward textual relatedness, while the smaller clusters are more oriented toward citation-based relatedness.

Highlights

With the increasing availability of large-scale bibliographic data and the increased capacity of algorithms to cluster these data, highly detailed science maps and models are becoming ever more common
We find that a hybrid relatedness measure based on direct citation and PubMed-related article scores gives more accurate clusters than the other relatedness measures tested
We show that for the hybrid measure based on direct citation and related article scores, the larger clusters are more oriented toward textual relatedness, while the smaller clusters are more oriented toward citation-based relatedness

Summary

Introduction

With the increasing availability of large-scale bibliographic data and the increased capacity of algorithms to cluster these data, highly detailed science maps and models are becoming ever more common. Few have attempted to compare the cluster-level results of such studies, with the work by Velden, Boyack, et al (2017) as a notable exception Those creating models of science have always sought to establish the validity of their models in some way, quantitative studies of accuracy are a more recent occurrence, for large-scale models. The first such study using a very large literature data set was done by Boyack and colleagues using a set of 2.15 million PubMed documents (Boyack & Klavans, 2010; Boyack, Newman, et al, 2011) It compared text-based, citation-based, and hybrid relatedness measures where titles, abstracts, and MeSH terms were obtained from PubMed while references for each document were obtained from Scopus via matching database records. Words that occurred in at least four and not more than 500 documents were included in the calculation

Methods

Results

Conclusion