Gold Standard Corpus Research Articles

Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.

Read full abstract

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e.named entity recognition) and normalization (i.e.entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e.appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.

Read full abstract

Gold Standard Corpus Research Articles

Related Topics

Articles published on Gold Standard Corpus

Extracting social support and social isolation information from clinical psychiatry notes: comparing a rule-based natural language processing system and a large language model.

Natural Language Processing Accurately Differentiates Cancer Symptom Information in Electronic Health Record Narratives.

A tree-based corpus annotated with Cyber-Syndrome, symptoms, and acupoints

Term-BLAST-like alignment tool for concept recognition in noisy clinical texts.

Automated classification of lay health articles using natural language processing: a case study on pregnancy health and postpartum depression.

Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences

Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline

Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations

An analysis of entity normalization evaluation biases in specialized domains

A study on methods for revising dependency treebanks: in search of gold

An expectation-maximization framework for comprehensive prediction of isoform-specific functions.

Inter-Annotator Agreement for the Factual Status of Predicates in the TAGFACT Corpus

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions

Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model.

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain

A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature

Development and validation of an automated basal cell carcinoma histopathology information extraction system using natural language processing.

Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Gold Standard Corpus Research Articles

Related Topics

Articles published on Gold Standard Corpus

Extracting social support and social isolation information from clinical psychiatry notes: comparing a rule-based natural language processing system and a large language model.

Natural Language Processing Accurately Differentiates Cancer Symptom Information in Electronic Health Record Narratives.

A tree-based corpus annotated with Cyber-Syndrome, symptoms, and acupoints

Term-BLAST-like alignment tool for concept recognition in noisy clinical texts.

Automated classification of lay health articles using natural language processing: a case study on pregnancy health and postpartum depression.

Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences

Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline

Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations

An analysis of entity normalization evaluation biases in specialized domains

A study on methods for revising dependency treebanks: in search of gold

An expectation-maximization framework for comprehensive prediction of isoform-specific functions.

Inter-Annotator Agreement for the Factual Status of Predicates in the TAGFACT Corpus

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions

Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model.

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain

A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature

Development and validation of an automated basal cell carcinoma histopathology information extraction system using natural language processing.

Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature