Abstract

Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.

Highlights

  • The transformation of texts from taxonomic literature into structured data remains a key challenge in biodiversity informatics, recognised by international initiatives such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), and the Biodiversity Heritage Library (BHL) (Hobern et al 2013, Thessen and Parr 2014, Salle et al 2009)

  • The taxonomic work, expressed in a simplified way, consists of organising all forms of life ideally in a hierarchy, assigning a Latin name to each taxon, a taxonomic category that associates it to a level in the hierarchy, a morphological description, a diagnostic description that is sometimes accompanied by diagnostic drawings, habitat description, information about its distribution, and identification keys, amongst other information

  • The semantic annotation results showed that, due to the semi-structured nature of morphological descriptions of plants, it is feasible to implement, with excellent results, a simple semantic analysis algorithm based on rules using available technology (i.e. FreeLing, Ontology Term Organizer (OTO), Plant Ontology (PO), and Flora Mesoamericana English-Spanish Glossary)

Read more

Summary

Introduction

The transformation of texts from taxonomic literature into structured data remains a key challenge in biodiversity informatics, recognised by international initiatives such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), and the Biodiversity Heritage Library (BHL) (Hobern et al 2013, Thessen and Parr 2014, Salle et al 2009). The BHL estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles) (Rinaldo et al 2009). It is necessary to develop data standards and software tools to extract, integrate and publish this knowledge into existing free and open access repositories to support science, education and biodiversity conservation. Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. The taxonomic work, expressed in a simplified way, consists of organising all forms of life ideally in a hierarchy, assigning a Latin name to each taxon, a taxonomic category that associates it to a level in the hierarchy, a morphological description, a diagnostic description that is sometimes accompanied by diagnostic drawings, habitat description, information about its distribution, and identification keys, amongst other information

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call