Abstract

Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.

Highlights

  • Linking the scientific literature to databases is a goal pursued by many sectors of the scientific community, driven by the need to enable scientists to navigate and analyse research information in a timely and comprehensive manner.In the field of biological sciences, there is a long tradition of partnership between journals and public databases, ensuring that data are archived and available for reuse in the long term

  • The performance of the Whatizit ANA pipeline might be a bit overestimated since we evaluate false positive annotations only and there are accession numbers likely to be missed by the publishers and the pipeline. doi:10.1371/journal.pone.0063184.t002

  • We manually analysed the false positive annotations which were provided from our pipeline, by reading the context of the database citation, given that structured accession numbers provided in articles might not be always complete or correct

Read more

Summary

Introduction

In the field of biological sciences, there is a long tradition of partnership between journals and public databases, ensuring that data are archived and available for reuse in the long term This began with an agreement between the EMBL Data Library and the journal Nucleic Acids Research (NAR) in 1988 [1]. Kahn and Hazledine outlined the new NAR policy, which was that any article that discussed or contained sequence data was required to show evidence that the sequence data has been deposited in the EMBL Data Library, i.e. the data had to be cited using an accession number This approach to data management was subsequently adopted widely by biological science journals and has since been applied to other types of data including protein structures and gene expression experiments, becoming standard practice in many areas of research. The publication of data and evidence of its reuse (i.e. citation) by the community could be a valid additional measure of the impact of a piece of research

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.