Abstract

BackgroundRecent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption.ResultsWe collected two sets of documents, one consisting only of Open Access publications and the other consisting only of traditional journal publications. We examined them for differences in surface linguistic structures that have obvious consequences for the ease or difficulty of natural language processing and for differences in semantic content as reflected in lexical items. Regarding surface linguistic structures, we examined the incidence of conjunctions, negation, passives, and pronominal anaphora, and found that the two collections did not differ. We also examined the distribution of sentence lengths and found that both collections were characterized by the same mode. Regarding lexical items, we found that the Kullback-Leibler divergence between the two collections was low, and was lower than the divergence between either collection and a reference corpus. Where small differences did exist, log likelihood analysis showed that they were primarily in the area of formatting and in specific named entities.ConclusionWe did not find structural or semantic differences between the Open Access and traditional journal collections.

Highlights

  • Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications

  • For much of the modern period of biomedical natural language processing (BioNLP) research, work in text mining has focused on abstracts of journal articles

  • Shah et al examined the location of biologically relevant words in journal articles and found that the density of biologically relevant terms is higher in the abstract than in the body of the article, there is much more relevant information in the body of the article than in the abstract [3]

Read more

Summary

Introduction

Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, and flawed science. For much of the modern period of biomedical natural language processing (BioNLP) research, work in text mining has focused on abstracts of journal articles. Corney et al (2004) provided a careful quantification of the costs of failing to work with full text, finding that more than half of the information in molecular biology papers was in the body of the text and not in the abstract [4]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.