Comparison of visual features for audio‐visual speech recognition using the AURORA‐2J‐AV database

Takahiro Togo,Chiyomi Miyajima,Yasuhito Suenaga,Yukitaka Nimura,Takayuki Kitasaka,Kazuya Takeda,Kensaku Mori

doi:10.1121/1.4787231

Abstract

Various techniques for noise‐robust speech recognition have been developed for many years. However, speech recognition in nonstationary noisy environments, such as in car conditions, is still a very challenging research problem. Audio‐visual speech recognition is a promising technique for improving noise robustness in such adverse acoustic conditions. Our research group has recently developed an audio‐visual speech database ‘‘AURORA‐2J‐AV’’ following the AURORA2 database protocol, which has been widely used for evaluating the performance of audio‐only speech recognition techniques. The AURORA‐2J‐AV database consists of audio speech signals and two types of image sequences: color and infrared human face images recorded when multi‐digit numbers are spoken by about 100 native Japanese speakers. Visual noise as well as acoustic noise is added to the database after recording for simulating various conditions. Audio‐visual speech recognition experiments are conducted using the audio speech and infrared images of the AURORA‐2J‐AV database. Two visual features, correlation coefficients (CCs) between two successive image frames and principal component scores (PCSs) of lip images, are compared in various signal‐to‐noise ratio (SNR) conditions. Experimental results show that CCs outperform PCSs at high SNRs, while PCSs function better at lower SNRs.

Full Text