A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition

Chao Sui,Mohammed Bennamoun,Roberto Togneri

doi:10.1016/j.specom.2017.01.005

Chao Sui, Mohammed Bennamoun + Show 1 more

Open Access

https://doi.org/10.1016/j.specom.2017.01.005

Copy DOI

Journal: Speech Communication	Publication Date: Mar 23, 2017
Citations: 12	License type: cc-by-nc-nd

Affiliation: University of Western Australia

Abstract

Although stereo information has been extensively used in computer vision tasks recently, the incorporation of stereo visual information in Audio-Visual Speech Recognition (AVSR) systems and whether it can boost the speech accuracy still remains a largely undeveloped area. This paper addresses three fundamental issues in this area: 1) Will the stereo features benefit visual and audio-visual speech recognition? 2) If so, how much information is embedded in stereo features? 3) How to encode both planar and stereo information in a compact feature vector? In this study, we propose a comprehensive study on the characteristics of both planar and stereo visual features, and extensively analyse why the stereo information can boost the visual speech recognition. Based on the different information embedded in planar and stereo features, we present a new Cascade Hybrid Appearance Visual Feature (CHAVF) extraction scheme which successfully combines planar and stereo visual information into a compact feature vector, and evaluate this novel feature on visual and audio-visual connected digit recognition and isolated phrase recognition. The results show that stereo information is capable of significantly boosting the speech recognition, and the performance of our proposed visual feature outperforms the other commonly used appearance-based visual features on both the visual and audio-visual speech recognition tasks. Particularly, our proposed planar-stereo visual feature yields approximately 21% relative improvement over the planar visual feature. To the best of our knowledge, this is the first paper that extensively evaluates the different characteristics of planar and stereo visual features, and we first show that using the stereo feature along with the planar feature can significantly boost the accuracy on a large-scale audio-visual data corpus.

Full Text