Abstract

BackgroundComputational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central.Methodology/Principal Findings72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: −0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively).Conclusion/SignificanceWhile quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.

Highlights

  • Computational methods have proven effective in the identification of highly similar and potentially unethical scientific articles

  • Full text analysis versus abstract analysis Applying a similarity ratio threshold of 0.5, we identified from the 72,011 PubMed Central (PMC) full text citations 150 citation pairs with both high abstract similarity and full text similarity, 598 pairs with high abstract similarity but no full text similarity, and 282 pairs with high full text similarity but no abstract similarity

  • We evaluated the strength of association between high abstract similarity and high full text similarity in the entire PMC dataset using a log odds ratio [8] of 6.6660.13

Read more

Summary

Introduction

Computational methods have proven effective in the identification of highly similar and potentially unethical scientific articles. The text similarity-based information retrieval search engine eTBLAST [1] was tuned with the MEDLINE abstract dataset [2] to create Dejavu, a publicly available database of over 70,000 highly similar biomedical citations [3]. Because it utilizes only abstracts to find similar citations, it inevitably omits potential duplicate full text articles whose abstracts may not appear similar enough to warrant further investigation. Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. We quantitatively investigated the full text similarity of biomedical publications in PubMed Central

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call