Abstract

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

Highlights

  • We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus

  • The markup process of the CRAFT corpus consisted of phases of automatic parsing and manual annotation and correction of all 97 articles in the corpus

  • For the dependency parser output, we report the individual score on each training fold, the average across the training folds, the score on the development set data for a model trained on the complete CRAFT training set, and the score on the development set data for the standard model for each parser trained on the Penn Treebank Wall Street Journal corpus

Read more

Summary

Introduction

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. Text mining of the biomedical literature has gained increasing attention in recent years, as biologists are increasingly faced with a body of literature that is too large and grows too rapidly to be reviewed by single researchers [1]. The majority of research in biomedical natural language processing has focused on the abstracts of journal articles. Cohen et al [4] compared abstracts and article bodies and found that they differed in a number of respects with implications for natural language processing. They noted that these differences sometimes demonstrably affected tool performance. Gene mention systems trained on abstracts suffered severe performance degradations when applied to full text

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.