Deep Learning based UPoS Tagger for Assamese Religious Text

Kuwali Talukdar,Farha Naznin,Ratul Deka,Shikhar Kumar Sarma

doi:10.61707/nn1dfz44

Abstract

Religious texts are known to be with specific patterns of writing, and also involve specific vocabularies. These are also known to follow specific style of writing. Thereby these texts are enriched with typical semantic and syntactic characteristics, demanding special attention for Natural Language Processing (NLP) tasks. This research paper focuses on the application of Deep Learning (DL) techniques for Parts of Speech (PoS) tagging focusing on Assamese language religious texts. We have created a specialized dataset comprising approximately 11,000 sentences extracted from various sources including web crawling and filtering religious texts from existing corpora. The dataset was manually validated by linguists to ensure accuracy, errors, and corrections required. A performance matrix was constructed to analyze the performance of the initial tagging using a pre-existing DL-based model trained for Assamese Universal Parts of Speech (UPoS) tagger. Following this, we utilized a subset of the dataset for manual evaluation, and the validated dataset is then considered as a gold standard training dataset for training other DL models using GRU, RNN and Bidirectional LSTM (BiLSTM) architectures. Training accuracies were recorded and presented, demonstrating the effectiveness of the proposed approach. Accuracies, Precision, and Recall were recorded for all the three Models. F1 scores also have been calculated. Comparison of training and testing accuracies are depicted with performance graphs.

Full Text