Multi-pose lipreading and audio-visual speech recognition

Jean-Phlippe Thiran,Virginia Estellers

doi:10.1186/preaccept-2074613880613707

Abstract

In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier.

Highlights

The performance of automatic speech recognition (ASR) systems degrades heavily in the presence of noise, compromising their use in real world scenarios
3 Pose-invariant lipreading we present the techniques adopted in face recognition to obtain a multi-pose system, justify the choice of linear regression (LR) as the technique best suited to our audio-visual automatic speech recognition (AV-ASR) system and study the different feature spaces where the pose normalization can take place
In AV-ASR we are interested in the influence of the pose normalization in the final performance and, specially, on the optimal value of the weight associated to the visual stream

Summary

Introduction

The performance of automatic speech recognition (ASR) systems degrades heavily in the presence of noise, compromising their use in real world scenarios. The weight assigned to the visual stream on the audiovisual classifier should account for the pose normalization Previous study on this topic is limited to Lucey et al [13,14], who projected the final visual speech features of complete profile images to a frontal viewpoint with a linear transform. The authors do not justify the use of a linear transform between the visual speech features of different poses, are limited to the extreme cases of completely frontal and profile views and their audiovisual experiments are not conclusive Compared to these studies, we introduce other projection techniques applied in face recognition to the lipreading task and discuss and justify their use in the different feature spaces involved in the lipreading system: the images themselves, a smooth and compact representation of the images in the frequency domain or the final features used in the classifier.

LDA intra 3 Visual

Linear regression in multi-pose face recognition

Conclusions

35 LR imagLeR DCT