Solr-Plant: efficient extraction of plant names from text

Vivekanand Sharma,Maria Isabel Restrepo,Indra Neil Sarkar

doi:10.1186/s12859-019-2874-6

Vivekanand Sharma, Maria Isabel Restrepo + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-019-2874-6

Copy DOI

Export

Save

Cite

Journal: BMC Bioinformatics	Publication Date: May 22, 2019
Citations: 5	License type: open-access

Affiliation: Providence College, Brown University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundThe retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names.ResultsAn Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant.ConclusionAutomated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names.

Highlights

The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources
A comparative report of the plant name mappings on two evaluation datasets are provided in Tables 1 and 2
The performance of Solr-Plant was slightly lower compared to Taxonomic Name Resolution Service (TNRS) (F-score: 0.95 versus 0.98); the approach itself showed better performance on normalizing misspelled names as shown on National Center for Biotechnology Information (NCBI) misspelling dataset

Summary

Introduction

The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Plant-related information is embedded across biodiversity and biomedical data sources. A requirement for such tasks is the ability to resolve names used in data sources to accepted taxonomic names. This is an essential step towards supporting the linking of knowledge across biodiversity data sources, acknowledging the species-centric nature of the discipline [18]. This reconciliation must include mapping of name variants to taxonomic concepts

Objectives

Results

Discussion

Conclusion