An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading.

Souheil Fenghour,Daqing Chen,Bo Li,Perry Xiao,Kun Guo

doi:10.3390/s21237890

Souheil Fenghour, Daqing Chen + Show 3 more

Open Access

PDF Available

https://doi.org/10.3390/s21237890

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

Highlights

The automation of lipreading has attracted a significant amount of research attention in the last several years
For the training and evaluation of the viseme-to-word converters mentioned in Section 3 excluding the Generative Pre-Training (GPT)-based interator, LRS2 sentence data described in Section 3.1 have been used with 80% of all sentences used for training (37,666 samples) and 20%
Of sentences being utilised for testing (9416 samples). k-Fold cross validation has been used with a fold value for k = 5, and for each fold, a different set of 9416 samples were used

Summary

Introduction

The automation of lipreading has attracted a significant amount of research attention in the last several years. A variety of different approaches have been utilised for classification, with deep learning-based approaches being popular for lipreading individuals uttering words and sentences. Lip movements can be decoded by using a variety of forms, including visemes, phonemes, characters, and words. Each of these forms can provide a different classification schema for designing automated lipreading systems. Such systems vary in their capabilities, ranging from recognising isolated speech segments in the form of individual words or characters to decoding entire sentences covering a wide range of vocabulary. The lexicons can consist of a vocabulary with thousands of different possible words

Methods

Results

Conclusion