Abstract

BackgroundBiology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology.ResultsWe have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing.Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/.

Highlights

  • Biology-focused databases and software define bioinformatics and their use is central to computational biology

  • In this paper we introduce and evaluate bioNerDS, a bioinformatics named-entity recognition system for database and software names, which is used to identify mentions of such entities in the literature

  • Each document is first pre-processed using a typical textmining pipeline consisting of tokenization, sentence splitting and part of speech tagging, all using General Architecture for Text Engineering (GATE)’s a Nearly-New Information Extraction System (ANNIE) plug-in [12,13]

Read more

Summary

Introduction

Biology-focused databases and software define bioinformatics and their use is central to computational biology. The fields of bioinformatics and computational biology are established as ones of rapid change with a continued expansion of the available “resourceome” [1], which includes numerous databases and software [1,2] Such resources facilitate research in biology, and many have become “household names” (e.g., BLAST [3], ClustalW [4], etc.). As well as helping with maintenance of resource catalogues, such systematic processing could offer insights into the dynamics of software and data resource usage, as many resources are infrequently used [2] This is of interest to users of these resources, who wish to know what is current and most used, and to any potential new users and resource developers

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call