COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

Nhung Nguyen,Sophia Ananiadou,Roselyn Gabud

doi:10.3897/bdj.7.e29626

Abstract

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

Highlights

Biodiversity plays a central role in our daily lives, given its implications on ecological resilience, food security, species and subspecies endangerment and natural sustainability
The basis for the gold standard corpus was a set of English documents downloaded from the Biodiversity Heritage Library (BHL)
We have described the process of constructing the Conserving Philippine Biodiversity by Understanding Big Data (COPIOUS) corpus, which is annotated with five entity categories relevant to the study of biodiversity: Taxon names, geographical locations, habitats, temporal expressions and persons

Summary

Introduction

Biodiversity plays a central role in our daily lives, given its implications on ecological resilience, food security, species and subspecies endangerment and natural sustainability. Research in this domain has recently seen accelerated growth, leading to the "big data" scenario of the biodiversity literature. Text mining has successfully been applied to the biomedical literature (Arighi et al 2013, Wei et al 2013, Mihăilă et al 2015, Ananiadou and Thompson 2017) and more recently, it has been employed in the biodiversity domain to unlock knowledge hidden in the literature In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biodiversity Data Journal	Publication Date: Jan 22, 2019
Citations: 20	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biodiversity Data Journal

Lead the way for us

Similar Papers

Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline
Roselyn Gabud ... Riza Batista-Navarro
Biodiversity Information Science and Standards | VOL. 7
Roselyn Gabud, et. al.Roselyn Gabud ... Riza Batista-Navarro
11 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

Biodiversity Heritage Library and Global Names: Successes, opportunities and the challenges for the future collaboration
Dmitry Mozzherin
Biodiversity Information Science and Standards | VOL. 5
Dmitry MozzherinDmitry Mozzherin
23 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research
Angela Shannen Tan ... Roselyn Gabud
Biodiversity Information Science and Standards | VOL. 8
Angela Shannen Tan, et. al.Angela Shannen Tan ... Roselyn Gabud
29 Oct 2024
Biodiversity Information Science and Standards | VOL. 8

Celebrating BHL Australia through the Eye of the (Tasmanian) Tiger
Nicole Kearney
Biodiversity Information Science and Standards | VOL. 7
Nicole KearneyNicole Kearney
08 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biodiversity Data Journal