Abstract

The need for a names-based cyber-infrastructure for digital biology is based on the argument that scientific names serve as a standardized metadata system that has been used consistently and near universally for 250 years. As we move towards data-centric biology, name-strings can be called on to discover, index, manage, and analyze accessible digital biodiversity information from multiple sources. Known impediments to the use of scientific names as metadata include synonyms, homonyms, mis-spellings, and the use of other strings as identifiers. We here compare the name-strings in GenBank, Catalogue of Life (CoL), and the Dryad Digital Repository (DRYAD) to assess the effectiveness of the current names-management toolkit developed by Global Names to achieve interoperability among distributed data sources. New tools that have been used here include Parser (to break name-strings into component parts and to promote the use of canonical versions of the names), a modified TaxaMatch fuzzy-matcher (to help manage typographical, transliteration, and OCR errors), and Cross-Mapper (to make comparisons among data sets). The data sources include scientific names at multiple ranks; vernacular (common) names; acronyms; strain identifiers and other surrogates including idiosyncratic abbreviations and concatenations. About 40% of the name-strings in GenBank are scientific names representing about 400,000 species or infraspecies and their synonyms.

Highlights

  • The ‘big new biology’ complements traditional and reductionist approaches to biological research because it will be based on open sharing of data that will enable co-operative enterprises and large scale projects (National Research Council of the National Academies 2009)

  • Along with phylogenetic informatics (Parr et al 2012), molecular bioinformatics, ecoinformatics (Michener and Jones 2012), and ontologies (Bard and Rhee 2004), a names-based cyberinfrastructure will make possible collaborative projects that extend across the scope and scale of biology, and create new opportunities for discovery

  • The data underpinning the analysis reported in this paper are deposited in the Dryad Data Repository at http://datadryad.org/submit?journalID=BDJ&manu=PJS_2_8080

Read more

Summary

Introduction

The ‘big new biology’ complements traditional and reductionist approaches to biological research because it will be based on open sharing of data that will enable co-operative enterprises and large scale projects (National Research Council of the National Academies 2009). Within this emerging area, names are said to have a special role (Patterson et al 2010; Pyle 2016) because, from the time of Linnaeus, biologists have applied a convention of forming and using scientific names. Along with phylogenetic informatics (Parr et al 2012), molecular bioinformatics, ecoinformatics (Michener and Jones 2012), and ontologies (Bard and Rhee 2004), a names-based cyberinfrastructure will make possible collaborative projects that extend across the scope and scale of biology, and create new opportunities for discovery

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call