Abstract

BackgroundThe retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names.ResultsAn Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant.ConclusionAutomated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names.

Highlights

  • The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources

  • A comparative report of the plant name mappings on two evaluation datasets are provided in Tables 1 and 2

  • The performance of Solr-Plant was slightly lower compared to Taxonomic Name Resolution Service (TNRS) (F-score: 0.95 versus 0.98); the approach itself showed better performance on normalizing misspelled names as shown on National Center for Biotechnology Information (NCBI) misspelling dataset

Read more

Summary

Introduction

The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Plant-related information is embedded across biodiversity and biomedical data sources. A requirement for such tasks is the ability to resolve names used in data sources to accepted taxonomic names. This is an essential step towards supporting the linking of knowledge across biodiversity data sources, acknowledging the species-centric nature of the discipline [18]. This reconciliation must include mapping of name variants to taxonomic concepts

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call