Identifying and classifying goals for scientific knowledge.

Mayla R Boguslav,Lawrence E Hunter,Elizabeth K White,Nourah M Salem,Sonia M Leach

doi:10.1093/bioadv/vbab012

Abstract

MotivationScience progresses by posing good questions, yet work in biomedical text mining has not focused on them much. We propose a novel idea for biomedical natural language processing: identifying and characterizing the questions stated in the biomedical literature. Formally, the task is to identify and characterize statements of ignorance, statements where scientific knowledge is missing or incomplete. The creation of such technology could have many significant impacts, from the training of PhD students to ranking publications and prioritizing funding based on particular questions of interest. The work presented here is intended as the first step towards these goals.ResultsWe present a novel ignorance taxonomy driven by the role statements of ignorance play in research, identifying specific goals for future scientific knowledge. Using this taxonomy and reliable annotation guidelines (inter-annotator agreement above 80%), we created a gold standard ignorance corpus of 60 full-text documents from the prenatal nutrition literature with over 10 000 annotations and used it to train classifiers that achieved over 0.80 F1 scores.Availability and implementationCorpus and source code freely available for download at https://github.com/UCDenver-ccp/Ignorance-Question-Work. The source code is implemented in Python.

Full Text