Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer

Mohanad Sameer Mohanad Sameer,Alla Hussein Alla Hussein,Ahmed Talib Ahmed Talib

doi:10.51173/jt.v5i1.749

Mohanad Sameer Mohanad Sameer, Alla Hussein Alla Hussein + Show 1 more

Open Access

PDF Available

https://doi.org/10.51173/jt.v5i1.749

Copy DOI

Export

Save

Cite

Journal: Journal of Techniques	Publication Date: Mar 21, 2023
Citations: 6	License type: CC BY 4.0

Affiliation: Middle Technical University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model.

Full Text