Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish.

Maria Mora,José Araya

doi:10.3897/bdj.6.e21282

Maria Mora, José Araya

Open Access

https://doi.org/10.3897/bdj.6.e21282

Copy DOI

Abstract

Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.

Highlights

The transformation of texts from taxonomic literature into structured data remains a key challenge in biodiversity informatics, recognised by international initiatives such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), and the Biodiversity Heritage Library (BHL) (Hobern et al 2013, Thessen and Parr 2014, Salle et al 2009)
The taxonomic work, expressed in a simplified way, consists of organising all forms of life ideally in a hierarchy, assigning a Latin name to each taxon, a taxonomic category that associates it to a level in the hierarchy, a morphological description, a diagnostic description that is sometimes accompanied by diagnostic drawings, habitat description, information about its distribution, and identification keys, amongst other information
The semantic annotation results showed that, due to the semi-structured nature of morphological descriptions of plants, it is feasible to implement, with excellent results, a simple semantic analysis algorithm based on rules using available technology (i.e. FreeLing, Ontology Term Organizer (OTO), Plant Ontology (PO), and Flora Mesoamericana English-Spanish Glossary)

Summary

Introduction

The transformation of texts from taxonomic literature into structured data remains a key challenge in biodiversity informatics, recognised by international initiatives such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), and the Biodiversity Heritage Library (BHL) (Hobern et al 2013, Thessen and Parr 2014, Salle et al 2009). The BHL estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles) (Rinaldo et al 2009). It is necessary to develop data standards and software tools to extract, integrate and publish this knowledge into existing free and open access repositories to support science, education and biodiversity conservation. Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. The taxonomic work, expressed in a simplified way, consists of organising all forms of life ideally in a hierarchy, assigning a Latin name to each taxon, a taxonomic category that associates it to a level in the hierarchy, a morphological description, a diagnostic description that is sometimes accompanied by diagnostic drawings, habitat description, information about its distribution, and identification keys, amongst other information

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biodiversity data journal	Publication Date: Jun 26, 2018
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biodiversity data journal

Lead the way for us

Similar Papers

Structuring Information from Plant Morphological Descriptions using Open Information Extraction
Maria Mora-Cross ... Brandon Retana Chacón
Biodiversity Information Science and Standards | VOL. 7
Maria Mora-Cross, et. al.Maria Mora-Cross ... Brandon Retana Chacón
21 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

Supplying the Missing Links: Providing immediate access to the taxonomic literature from our taxonomic databases
Nicole Kearney ... Roderic Page
Biodiversity Information Science and Standards | VOL. 6
Nicole Kearney, et. al.Nicole Kearney ... Roderic Page
01 Aug 2022
Biodiversity Information Science and Standards | VOL. 6

Celebrating BHL Australia through the Eye of the (Tasmanian) Tiger
Nicole Kearney
Biodiversity Information Science and Standards | VOL. 7
Nicole KearneyNicole Kearney
08 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

#RetroPIDs: The missing link to the foundation of biodiversity knowledge
Nicole Kearney ... Roderic Page
Biodiversity Information Science and Standards | VOL. 5
Nicole Kearney, et. al.Nicole Kearney ... Roderic Page
08 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biodiversity data journal