Dialectal Arabic Speech Recognition using CNN-LSTM Based on End-to-End Deep Learning

Hamzah A Alsayadi,Fahd A Alqasemi,Abdelaziz A Abdelhamid,Salah Al-Hagree

doi:10.1109/esmarta56775.2022.9935427

Abstract

Arabic dialect is the type of Arabic language which is used for daily communication in the Arab world. Each Arab country has a unique dialect. Due to the challenges of recognizing the spoken Arabic dialects, such as the significant variations in Arabic dialects and the lack of standard structure, there are few research attempts in the literature that address this problem. The end-to-end deep learning approach provides a promising approach for improving the performance of speech recognition systems. However, with deep learning techniques, overfitting is still the main problem with little data. In this paper, we investigate the end-to-end model to improve dialectal Arabic speech recognition (DASR) based on deep learning. Data augmentation is one of the significant steps in the proposed approach that is applied to the employed dataset to increase training data. The proposed approach is based on a hybrid model that is composed of a convolutional neural network (CNN) and long short-term memory (LSTM) and is referred to as (CNN-LSTM). This model is utilized along with attention-based encoder-decoder methods for building the acoustic model and performing the decoding process. To the best of our knowledge, there is no prior research that employed CNN-LSTM and attention-based models in dialectal Arabic ASR systems. In addition, the language model is built based on recurrent neural network (RNN) and LSTM methods. The proposed approach is validated using two speech datasets, namely SASSC and MGB-3. Experimental results showed that the proposed approach achieved a word error rate (WER) of 57.02%. This result is superior to those achieved by the other approaches in the literature. Lower WER often indicates that the ASR system is more accurate.

Full Text