Abstract

BackgroundNamed Entity (NE) extraction is one of the most fundamental and important tasks in biomedical information extraction. It involves identification of certain entities from text and their classification into some predefined categories. In the biomedical community, there is yet no general consensus regarding named entity (NE) annotation; thus, it is very difficult to compare the existing systems due to corpus incompatibilities. Due to this problem we can not also exploit the advantages of using different corpora together. In our present work we address the issues of corpus compatibilities, and use a single objective optimization (SOO) based classifier ensemble technique that uses the search capability of genetic algorithm (GA) for NE extraction in biomedicine. We hypothesize that the reliability of predictions of each classifier differs among the various output classes. We use Conditional Random Field (CRF) and Support Vector Machine (SVM) frameworks to build a number of models depending upon the various representations of the set of features and/or feature templates. It is to be noted that we tried to extract the features without using any deep domain knowledge and/or resources.ResultsIn order to assess the challenges of corpus compatibilities, we experiment with the different benchmark datasets and their various combinations. Comparison results with the existing approaches prove the efficacy of the used technique. GA based ensemble achieves around 2% performance improvements over the individual classifiers. Degradation in performance on the integrated corpus clearly shows the difficulties of the task.ConclusionsIn summary, our used ensemble based approach attains the state-of-the-art performance levels for entity extraction in three different kinds of biomedical datasets. The possible reasons behind the better performance in our used approach are the (i). use of variety and rich features as described in Subsection “Features for named entity extraction”; (ii) use of GA based classifier ensemble technique to combine the outputs of multiple classifiers.

Highlights

  • Named Entity (NE) extraction is one of the most fundamental and important tasks in biomedical information extraction

  • Datasets and experimental setup We evaluate our approach with three benchmark datasets, namely JNLPBA 2004 shared task6, AIMed and GENETAG

  • For GENIA corpus the best individual classifier produces the best recall, precision and F-measure values of 73.10%, 76.78% and 74.90%, respectively. This corresponds to a Conditional Random Field (CRF) based classifier with the following feature template: the contexts of previous and two tokens and their all possible ngram (n ≤ 2) combinations from left to right, prefixes and suffixes of length up to 3 characters of only the current word, feature vector consisting of length, infrequent word, normalization, chunk, orthographic constructs, Table 1 Overall evaluation results on the original corpus (Saha et al 2013)

Read more

Summary

Introduction

Named Entity (NE) extraction is one of the most fundamental and important tasks in biomedical information extraction. There is yet no general consensus regarding named entity (NE) annotation; it is very difficult to compare the existing systems due to corpus incompatibilities. Due to this problem we can not exploit the advantages of using different corpora together. This involves two different stages, i.e. identification of certain kinds of entities and classification of them into some predefined categories. It is not possible to use all the available corpora together for building any supervised NE extraction system This reduces to two different problems, viz. This reduces to two different problems, viz. (i). it is hard to compare systems which are created using different corpora and (ii). there is hardly any existing state-of-theart NE extraction system which can perform well for many domains

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call