Abstract

Abstract With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach’s accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study’s findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call