The automatic classification of the Arabic language dialects is the preliminary step towards establishing many dialectal-sensitive Arabic natural language processing tasks. Arabic dialect identification entails predicting the dialects associated with specific textual inputs and classifying them under their respected labels. In this paper, we propose a novel approach by merging several distinct datasets to obtain a large, diverse, and bias-free dialectal corpus. Our dataset is a collection of parallel sentences translated into multiple dialects (MADAR), as well as tweets gathered from Twitter users (NADI-2020, NADI-2021, QADI, & ARAP-Tweet 2.0). The collected dataset is classified into seven labels: Gulf, Levant, Iraq, Maghreb, Nile Basin, Yemen, and Modern Standard Arabic. The merged dataset was cleansed to produce Arabic sentences free of punctuation, non-Arabic characters, numbers, repeating characters, empty lines, and elongation characters. We obtained a high dialectal classification accuracy using a new Word2vec embedding model on the merged dialectal dataset. Seven deep learning systems were trained on a balanced subset of the dataset. These models are Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional-LSTM (BILSTM), Bidirectional-GRU (BIGRU), Convolutional-LSTM (CLSTM), and Convolutional-GRU (CGRU). Single-label classification tests ran on these trained models acquired a minimum accuracy of 77.90% (CNN) and a maximum accuracy of 81.52% (BILSTM). We further evaluated the accuracy of tests ran for short and long sentences with 87.52% attained accuracy for classifying short sentences with CGRU and 94.06% for long sentences using BIGRU, both of which are indications of proposed approach efficacy to move forward.
Read full abstract