Abstract

This paper reports about the development of a Named Entity Recognition (NER) system in Indian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu using the statistical Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL) and tagged with the twelve NE tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as language specific features. Language independent features include the contextual words, prefixes and suffixes of all the words in the training corpus, several digit features depending upon the presence and/or the number of digits in a token, first word of the sentence and the frequency features of the words. The system considers linguistic features, particularly for Bengali and Hindi. Linguistic features of Bengali include the set of known suffixes that may appear with NEs, clue words that help in predicting the location and organization names, words that help to recognize measurement expressions, designation words that help to identify person names, various gazetteer lists like the first names, middle names, last names, location names, organization names, function words, month names, weekdays, etc. As part of linguistic features for Hindi, the system uses only the lists of first names, middle names, last names, function words, month names and weekdays along with the list of words that helps to recognize measurements. In addition to the other features, part of speech (POS) information of the word has been also considered for Bengali and Hindi. No linguistic features have been considered for Telugu, Oriya and Urdu. It has been observed from the evaluation results that the use of linguistic features improves the performance of the system. The system has been trained with 122,467 Bengali, 502,974 Hindi, 64,026 Telugu, 93,173 Oriya and 35,447 Urdu tokens. The system has demonstrated the highest overall average Recall, Precision, and F-Score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali with the 10-fold cross validation test. Experimental results of the 10-fold cross validation tests on the Hindi, Telugu, Oriya, and Urdu data have shown the overall average F-Score values of 82.66%, 70.11%, 70.13%, and 69.3%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call