Intelligent Part of Speech tagger for Hindi

Devashish Dutta,Subhanu Halder,Tirthankar Gayen

doi:10.1016/j.procs.2023.01.042

Abstract

English Part of Speech like noun, verb, adverb, adjective, pronoun, preposition, interjection, conjunction is somewhat similar in Hindi but not exactly the same. Hindi grammar has different Part of Speech (POS) based on its morphological features and the occurrence of a word/lexeme in a sentence. The existing techniques used in English language for POS tagging may not work properly for Indian language like Hindi. It is because the grammatical structure of the relatively free word order language like Hindi differs from English. Stochastic taggers may not give good performance as morphological information is not taken into account. The available Hindi word corpora usually have less frequency for individual tags. As a result, a larger size corpus having diversity in the type of sentences can provide better results. But, even after using smoothing techniques most these taggers fail to provide correct results in the presence of unknown words. Considering these aspects, this paper proposes an Intelligent POS tagger for Hindi language based on VITERBI and K-Nearest Neighbour, capable of providing more accurate results than VITERBI in the presence of unknown words.

Full Text