Abstract
BackgroundGene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects. For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present a series of sequence labeling systems that we used and adapted in our experiments for solving this task. Our experiments show how to optimize the hyperparameters of the classifiers involved. To this end, we utilize various algorithms for hyperparameter optimization. Finally, we present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier.ResultsWe analyze the impact of hyperparameter optimization regarding named entity recognition in biomedical research and show that this optimization results in a performance increase of up to 60%. In our evaluation, our ensemble classifier based on multiple sequence labelers, called CRFVoter, outperforms each individual extractor’s performance. For the blinded test set provided by the BioCreative organizers, CRFVoter achieves an F-score of 75%, a recall of 71% and a precision of 80%. For the GPRO type 1 evaluation, CRFVoter achieves an F-Score of 73%, a recall of 70% and achieved the best precision (77%) among all task participants.ConclusionCRFVoter is effective when multiple sequence labeling systems are to be used and performs better then the individual systems collected by it.
Highlights
The research fields of biology, chemistry and biomedicine have attracted increasing interest due to their social and scientific importance and because of the challenges arising from the intrinsic complexity of these domains
Our experiments show that a simple majority vote brings no gain in precision and recall compared to the best performing reference systems being examined in our study
We present a survey of Named Entity Recognition (NER) trained for the Gene and Protein Related Object Recognition (GPRO) tasks and the parameter settings optimized by means of the Tree-structured Parzen Estimator (TPE) hyperparameter optimization algorithm
Summary
The research fields of biology, chemistry and biomedicine have attracted increasing interest due to their social and scientific importance and because of the challenges arising from the intrinsic complexity of these domains. The growth in the amount of publicly available data has led to enormous efforts to develop, analyze and apply new learning methods in the field of chemistry and biology This concerns, for example, virtual screening [11] for drug design and drug discovery [12, 13]. Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. We describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have