Abstract

This paper reports on the development of a Named Entity Recognition (NER) system in Bengali by combining the outputs of the three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). A part of the Bengali news corpus developed from the web-archive of a leading Bengali newspaper has been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. We have also used the annotated corpus of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL). An appropriate tag conversion routine has been developed in order to convert the fine-grained NE tagged NERSSEAL corpus to the form, tagged with the coarse-grained NE tagset of four tags. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as the language dependent features extracted from the various language specific resources. Lexical context patterns, which are generated from an unlabeled corpus of 10 million wordforms using an active learning technique, have been used for developing a baseline NER system as well as the features of the classifiers in order to improve their performance. A number of post-processing techniques have been used in order to improve the performance of the classifiers. Finally, the classifiers are combined together into a multiengine NER system using three weighted voting techniques. The system has been trained and tested with the datasets of 272K wordforms and 35K wordforms, respectively. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 93.81%, 92.18% and 92.98%, respectively. The proposed system also outperforms the three other existing Bengali NER systems. The language independent versions of the ME, CRF and SVM based NER systems have been evaluated for the four other popular Indian languages, namely Hindi, Telugu, Oriya and Urdu, with the datasets obtained from the NERSSEAL shared task data. The SVM based system yielded the best performance with the F-Score values of 76.35%, 72.65%, 69.34% and 65.66% for Hindi, Telugu, Oriya and Urdu, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call