Abstract

Part of Speech (POS) tagging is the process of marking up a word in a sentence to a corresponding part of speech. POS tagging is considered one of the pre-processing steps in Natural Language Processing (NLP) applications such as speech recognition, machine translation and sentiment analysis. A few works have been conducted to determine the POS tags for the Tamil words. However, the performance of the POS tagger with unknown words (words that do not appear in the lexicon) is not explored in the literature. The appearance of unknown words is a frequently occurring problem in POS tagging because, in real-world use, the NLP application will encounter words that are not in its lexicon. This paper proposes a deep learning-based POS tagger for the Tamil language using Bi-directional Long Short Term Memory (BLSTM). Our experiments use two corpora, one is AU-KBC annotated corpus, and the other is MeitY corpus. We also analysed the performance of the POS tagger with unknown words. Test results show that the POS tags for Tamil words determined by this approach have 99.8%, 99.5% and 96.5% accuracies for only known words, around 9.8% unknown words and 47.6% unknown words in test sentences respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.