A Multiengine NER System with Context Pattern Learning and Post-processing Improves System Performance

Asif Ekbal,Sivaji Bandyopadhyay

doi:10.1142/s1793840609002068

Abstract

This paper reports on the development of a Named Entity Recognition (NER) system in Bengali by combining the outputs of the three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). A part of the Bengali news corpus developed from the web-archive of a leading Bengali newspaper has been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. We have also used the annotated corpus of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL). An appropriate tag conversion routine has been developed in order to convert the fine-grained NE tagged NERSSEAL corpus to the form, tagged with the coarse-grained NE tagset of four tags. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as the language dependent features extracted from the various language specific resources. Lexical context patterns, which are generated from an unlabeled corpus of 10 million wordforms using an active learning technique, have been used for developing a baseline NER system as well as the features of the classifiers in order to improve their performance. A number of post-processing techniques have been used in order to improve the performance of the classifiers. Finally, the classifiers are combined together into a multiengine NER system using three weighted voting techniques. The system has been trained and tested with the datasets of 272K wordforms and 35K wordforms, respectively. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 93.81%, 92.18% and 92.98%, respectively. The proposed system also outperforms the three other existing Bengali NER systems. The language independent versions of the ME, CRF and SVM based NER systems have been evaluated for the four other popular Indian languages, namely Hindi, Telugu, Oriya and Urdu, with the datasets obtained from the NERSSEAL shared task data. The SVM based system yielded the best performance with the F-Score values of 76.35%, 72.65%, 69.34% and 65.66% for Hindi, Telugu, Oriya and Urdu, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Multiengine NER System with Context Pattern Learning and Post-processing Improves System Performance

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Processing of Languages

Lead the way for us

Journal: International Journal of Computer Processing of Languages	Publication Date: Jun 1, 2009
Citations: 2

Similar Papers

Named Entity Recognition in Bengali
Asif Ekbal ... Sivaji Bandyopadhyay
Northern European Journal of Language Technology | VOL. 1
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
02 Feb 2010
Northern European Journal of Language Technology | VOL. 1

Named Entity Recognition using Support Vector Machine: A Language Independent Approach
...
Zenodo (CERN European Organization for Nuclear Research) | VOL. -
, et. al. ...
23 Mar 2010
Zenodo (CERN European Organization for Nuclear Research) | VOL. -

Named Entity Recognition in Indian Languages Using Maximum Entropy Approach
Asif Ekbal ... Sivaji Bandyopadhyay
International Journal of Computer Processing of Languages | VOL. 21
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
01 Sep 2008
International Journal of Computer Processing of Languages | VOL. 21

Named entity recognition in Bengali and Hindi using support vector machine
Asif Ekbal ... Sivaji Bandyopadhyay
Lingvisticæ Investigationes | VOL. 34
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
07 Jul 2011
Lingvisticæ Investigationes | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Multiengine NER System with Context Pattern Learning and Post-processing Improves System Performance

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Processing of Languages