Deep Learning-Based Approach for Arabic Visual Speech Recognition

Insaf Ullah,Fahad Algarni,Hira Zahid,Muhammad Asghar Khan

doi:10.32604/cmc.2022.019450

Abstract

Lip-reading technologies are rapidly progressing following the breakthrough of deep learning. It plays a vital role in its many applications, such as: human-machine communication practices or security applications. In this paper, we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms. The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers. The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase. Firstly, we extract keyframes from our dataset. Secondly, we produce a Concatenated Frame Images (CFIs) that represent the utterance sequence in one single image. Finally, the VGG-19 is employed for visual features extraction in our proposed model. We have examined different keyframes: 10, 15, and 20 for comparing two types of approaches in the proposed model: (1) the VGG-19 base model and (2) VGG-19 base model with batch normalization. The results show that the second approach achieves greater accuracy: 94% for digit recognition, 97% for phrase recognition, and 93% for digits and phrases recognition in the test dataset. Therefore, our proposed model is superior to models based on CFIs input.

Highlights

Lip-reading recognition system which is known as Visual Speech Recognition (VSR), plays an essential role in human language communication and visual knowledge
The Discrete Cosine Transform (DCT) technique is used for extracting visual features as proposed by Elrefaei et al [11], the results show that the average Word Recognition Rate (WRR) is 70.09%, by applying the Support Vector Machine (SVM) for classification
The proposed Arabic visual speech recognition model is capable of classifying digits and phrases in the Arabic language by examining our collected visual datasets

Summary

Introduction

Lip-reading recognition system which is known as Visual Speech Recognition (VSR), plays an essential role in human language communication and visual knowledge. It refers to the ability to learn or recognize visual speech without the need to hear the audio, and it works only with visual data (such as movements of the lips and face). Lip-reading technology is an appealing area of study for researchers because, by recognizing visual information without the need of audio, it introduces a new tool in visual speech recognition for situations in which audio is not available or must be secured professionally. Recognizing spoken words from the speaker’s lip movement is called visual lip-reading, and it is an efficacious communication form in many situations. For example, can be served by this useful hearing aid [1].

Objectives

Results

Conclusion