Lip-reading classification has received a lot of interest in recent decades because it is widely used in a variety of fields. It plays an important role in interpreting spoken words in noisy situations and reconstructing communication processes for those with hearing impairments. Despite significant advancements in this field, there are still several drawbacks in existing work such as feature extraction and Model capability for visual speech recognition. For these reasons, the current paper suggests an Optimized Quaternion Meixner Moments Convolutional Neural Network (OQMMs-CNN) method that intends to develop a Visual Speech Recognition (VSR) system based only on video images. This unique method combines OQMMs optimized for the GWO algorithm and convolutional neural networks taken from deep learning techniques with the aim of recognizing digits, words, or letters displayed as input videos.The OQMMs are used here as descriptors with the purpose of identifying, holding, and extracting essential information from video images (lips images) and generating moments for CNN input. The latter uses Meixner polynomials, which are defined by local parameters α and β. Then, the Grey Wolf optimization method (GWO) is applied to enssure excellent classification accuracy by optimizing those local parameters. After being tested on three public datasets such as AVLetters, Grid, AVDigits, and LRW, and comparing to several ways using complicated models and deep architecture, the method emerges as an excellent solution for reducing the high dimensionality of video pictures and training time.
Read full abstract