Named Entity Recognition in Indian Languages Using Maximum Entropy Approach

Asif Ekbal,Sivaji Bandyopadhyay

doi:10.1142/s1793840608001913

Abstract

This paper reports about the development of a Named Entity Recognition (NER) system in Indian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu using the statistical Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL) and tagged with the twelve NE tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as language specific features. Language independent features include the contextual words, prefixes and suffixes of all the words in the training corpus, several digit features depending upon the presence and/or the number of digits in a token, first word of the sentence and the frequency features of the words. The system considers linguistic features, particularly for Bengali and Hindi. Linguistic features of Bengali include the set of known suffixes that may appear with NEs, clue words that help in predicting the location and organization names, words that help to recognize measurement expressions, designation words that help to identify person names, various gazetteer lists like the first names, middle names, last names, location names, organization names, function words, month names, weekdays, etc. As part of linguistic features for Hindi, the system uses only the lists of first names, middle names, last names, function words, month names and weekdays along with the list of words that helps to recognize measurements. In addition to the other features, part of speech (POS) information of the word has been also considered for Bengali and Hindi. No linguistic features have been considered for Telugu, Oriya and Urdu. It has been observed from the evaluation results that the use of linguistic features improves the performance of the system. The system has been trained with 122,467 Bengali, 502,974 Hindi, 64,026 Telugu, 93,173 Oriya and 35,447 Urdu tokens. The system has demonstrated the highest overall average Recall, Precision, and F-Score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali with the 10-fold cross validation test. Experimental results of the 10-fold cross validation tests on the Hindi, Telugu, Oriya, and Urdu data have shown the overall average F-Score values of 82.66%, 70.11%, 70.13%, and 69.3%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Named Entity Recognition in Indian Languages Using Maximum Entropy Approach

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Processing of Languages

Lead the way for us

Journal: International Journal of Computer Processing of Languages	Publication Date: Sep 1, 2008
Citations: 19

Similar Papers

A Multiengine NER System with Context Pattern Learning and Post-processing Improves System Performance
Asif Ekbal ... Sivaji Bandyopadhyay
International Journal of Computer Processing of Languages | VOL. 22
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
01 Jun 2009
International Journal of Computer Processing of Languages | VOL. 22

Named entity recognition in Bengali and Hindi using support vector machine
Asif Ekbal ... Sivaji Bandyopadhyay
Lingvisticæ Investigationes | VOL. 34
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
07 Jul 2011
Lingvisticæ Investigationes | VOL. 34

Named Entity Recognition using Support Vector Machine: A Language Independent Approach
...
Zenodo (CERN European Organization for Nuclear Research) | VOL. -
, et. al. ...
23 Mar 2010
Zenodo (CERN European Organization for Nuclear Research) | VOL. -

Named Entity Recognition in Bengali
Asif Ekbal ... Sivaji Bandyopadhyay
Northern European Journal of Language Technology | VOL. 1
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
02 Feb 2010
Northern European Journal of Language Technology | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Named Entity Recognition in Indian Languages Using Maximum Entropy Approach

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Processing of Languages