Abstract

Part of Speech (POS) tagging is the process of marking up a word in a sentence to a corresponding part of speech. POS tagging is considered one of the pre-processing steps in Natural Language Processing (NLP) applications such as speech recognition, machine translation and sentiment analysis. A few works have been conducted to determine the POS tags for the Tamil words. However, the performance of the POS tagger with unknown words (words that do not appear in the lexicon) is not explored in the literature. The appearance of unknown words is a frequently occurring problem in POS tagging because, in real-world use, the NLP application will encounter words that are not in its lexicon. This paper proposes a deep learning-based POS tagger for the Tamil language using Bi-directional Long Short Term Memory (BLSTM). Our experiments use two corpora, one is AU-KBC annotated corpus, and the other is MeitY corpus. We also analysed the performance of the POS tagger with unknown words. Test results show that the POS tags for Tamil words determined by this approach have 99.8%, 99.5% and 96.5% accuracies for only known words, around 9.8% unknown words and 47.6% unknown words in test sentences respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call