Parts-of-speech tagger for Sindhi language using deep neural network architecture

Adnan Ali Memon,Saman Hina,Abdul Karim Kazi,Saad Ahmed

doi:10.22581/muet1982.2768

Abstract

Language is a fundamental medium for human communication, encompassing spoken and written forms, each governed by grammatical rules. Sindhi, one of the oldest languages, is characterized by its rich morphology and grammatical structure. Part-of-speech (POS) tagging, a crucial process in natural language processing, involves assigning grammatical tags to words. This research presents a novel approach to POS tagging for Sindhi text using deep learning techniques. We developed a POS tagger employing Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models, with LSTM demonstrating superior effectiveness. This study represents the first application of these deep learning methods for POS tagging in Sindhi. Utilizing fastText, we trained 79,959 Sindhi word vectors, derived from a corpus compiled from diverse sources including Sindhi books, stories, and poetry. The corpus comprises 1,459 sentences and 10,584 unique words, split into 80% for training and 20% for validation. Our results indicate that the LSTM model achieved an accuracy of 85.80%, outperforming the GRU model, which achieved 80.77%, by a margin of 5%. This work's novelty lies in the application of deep learning techniques to enhance POS tagging accuracy in the Sindhi language corpus.

Full Text