Arabic dialect identification in social media: A hybrid model with transformer models and BiLSTM

Amjad A Alsuwaylimi

doi:10.1016/j.heliyon.2024.e36280

Abstract

Arabic Dialect Identification (ADI) is a challenging task in natural language processing applications due to its diversity and regional variations. Despite previous efforts, this task is still difficult. Therefore, this study aims to use transformers to address the issue of ADI on social media. A combination of two hybrid models is proposed in this study: one that combines Bidirectional Long Short-Term Memory (BiLSTM) with CAMeLBERT, and the second model that combines the BiLSTM model with AlBERT. In addition, a novel dataset comprising 121,289 user-generated comments from various social media network platforms and four major Arabic dialects (Egyptian, Jordanian, Gulf and Yemeni) was introduced. Several experiments have been conducted using conventional Machine Learning Classifiers (MLCs) and Deep Learning Models (DLMs) as baselines to measure the performance and effectiveness of the proposed models. In addition, binary classification is performed between two dialects to determine which are closest to each other. The performance of the model is measured using common metrics such as precision, recall, F-score and F-measure. Experiment results demonstrate the superior efficiency of the proposed hybrid models in ADI, CAMeLBERT with BiLSTM and ALBERT with BiLSTM, which both recorded an accuracy of 87.67 % and 86.51 %, respectively.

Full Text