A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Karin Verspoor,Colin Warner,Nianwen Xue,Yuriy Malenkiy,Helen L Johnson,Michael Bada,Jinho D Choi,Christophe Roeder,William A Baumgartner,Lawrence E Hunter,Kevin Bretonnel Cohen,Arrick Lanfranchi,Martha Palmer,Miriam Eckert,Christopher Funk

doi:10.1186/1471-2105-13-207

Karin Verspoor, Colin Warner + Show 13 more

Open Access

https://doi.org/10.1186/1471-2105-13-207

Copy DOI

Abstract

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

Highlights

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus
The markup process of the CRAFT corpus consisted of phases of automatic parsing and manual annotation and correction of all 97 articles in the corpus
For the dependency parser output, we report the individual score on each training fold, the average across the training folds, the score on the development set data for a model trained on the complete CRAFT training set, and the score on the development set data for the standard model for each parser trained on the Penn Treebank Wall Street Journal corpus

Summary

Introduction

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. Text mining of the biomedical literature has gained increasing attention in recent years, as biologists are increasingly faced with a body of literature that is too large and grows too rapidly to be reviewed by single researchers [1]. The majority of research in biomedical natural language processing has focused on the abstracts of journal articles. Cohen et al [4] compared abstracts and article bodies and found that they differed in a number of respects with implications for natural language processing. They noted that these differences sometimes demonstrably affected tool performance. Gene mention systems trained on abstracts suffered severe performance degradations when applied to full text

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 17, 2012
Citations: 151	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Improving the robustness and accuracy of biomedical language models through adversarial training
Milad Moradi ... Matthias Samwald
Journal of Biomedical Informatics | VOL. 132
Milad Moradi, et. al.Milad Moradi ... Matthias Samwald
15 Jun 2022
Journal of Biomedical Informatics | VOL. 132

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.
Anastassia Shaitarova ... Michael Krauthammer
Yearbook of Medical Informatics | VOL. 32
Anastassia Shaitarova, et. al.Anastassia Shaitarova ... Michael Krauthammer
01 Aug 2023
Yearbook of Medical Informatics | VOL. 32

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT
Usman Naseem ... Adam G Dunn
BMC Bioinformatics | VOL. 23
Usman Naseem, et. al.Usman Naseem ... Adam G Dunn
21 Apr 2022
BMC Bioinformatics | VOL. 23

Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction
Hermenegildo Fabregat ... Lourdes Araujo
Journal of Biomedical Informatics | VOL. 138
Hermenegildo Fabregat, et. al.Hermenegildo Fabregat ... Lourdes Araujo
04 Jan 2023
Journal of Biomedical Informatics | VOL. 138

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics