Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study

Tahani Alqurashi

doi:10.3390/app122312435

Tahani Alqurashi

Open Access

PDF Available

https://doi.org/10.3390/app122312435

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: Dec 5, 2022
Citations: 4	License type: CC BY 4.0

Affiliation: Umm al-Qura University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Arabic dialect identification (ADI) has recently drawn considerable interest among researchers in language recognition and natural language processing fields. This study investigated the use of a character-level model that is effectively unrestricted in its vocabulary, to identify fine-grained Arabic language dialects in the form of short written text. The Saudi dialects, particularly the four main Saudi dialects across the country, were considered in this study. The proposed ADI approach consists of five main phases, namely dialect data collection, data preprocessing and labelling, character-based feature extraction, deep learning character-based model/classical machine learning character-based models, and model evaluation performance. Several classical machine learning methods, including logistic regression, stochastic gradient descent, variations of the naive Bayes models, and support vector classification, were applied to the dataset. For the deep learning, the character convolutional neural network (CNN) model was adapted with a bidirectional long short-term memory approach. The collected data were tested under various classification tasks, including two-, three- and four-way ADI tasks. The results revealed that classical machine learning algorithms outperformed the CNN approach. Moreover, the use of the term frequency–inverse document frequency, combined with a character n-grams model ranging from unigrams to four-grams achieved the best performance among the tested parameters.

Full Text