Abstract

Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint-peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.

Highlights

  • The dissemination of research findings is key to science

  • Text analytics is generally comparative in nature, so we selected 3 relevant text corpora for analysis: the bioRxiv corpus, which is the target of the investigation; the PubMed Central Open Access (PMCOA) corpus, which represents the peer-reviewed biomedical literature; and the New York Times Annotated Corpus (NYTAC), which is used a representative of general English text

  • Over 77% of bioRxiv preprints with a corresponding publication in our snapshot were successfully detected within PMCOA corpus

Read more

Summary

Introduction

The dissemination of research findings is key to science. Much of this communication happened orally [1]. Examining linguistic shifts during publication licensed under the BSD 3-Clause and Creative Commons Public Domain Dedication Licenses at https://github.com/greenelab/annorxiver. The preprint similarity search website can be found at https://greenelab.github.io/preprint-similaritysearch/, and code for the website is available under a BSD-2-Clause Plus Patent License at https:// github.com/greenelab/preprint-similarity-search. All corresponding data for every figure in this manuscript is available at https://github.com/ greenelab/annorxiver/blob/master/FIGURE_DATA_ SOURCE.md. Full text access for the bioRxiv repository is available at https://www.biorxiv.org/ tdm. Access to PubMed Central’s Open Access subset is available on NCBI’s FTP server at https:// www.ncbi.nlm.nih.gov/pmc/tools/ftp/.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.