GAMVA: A Japanese Audio-Visual Multi-Angle Speech Corpus

Shinnosuke Isobe,Satoshi Tamura,Yuuto Gotoh,Tomohiro Hattori,Ryuichi Hirose,Takumi Nishiwaki,Masaki Nose

doi:10.1109/o-cocosda202152914.2021.9660495

Abstract

Audio-visual speech recognition (AVSR) has contributed to improve Automatic Speech Recognition (ASR) accuracy in noisy environments. In real-world scenarios, a speaker does not always face to a camera. Therefore, it is obvious that a visual speech recognition (VSR) system should correctly recognize spoken contents not only from frontal but also non-frontal faces. In this paper, we introduce our efforts to build a new multi-angle audio-visual speech corpus: GAMVA. GAMVA consists of multi-angle lip images captured by 12 cameras, face feature points and speech sounds. So far GAMVA has 20 Japanese male subjects, and each subject uttered 25 types of Japanese daily-use greeting utterances. A baseline ASR, VSR and AVSR systems based on our previous works are built to evaluate our GAMVA data. Experimental results showed that we could recognize lip images at various angles and improve speech recognition accuracy in noisy environments.

Full Text