Abstract

It is well known that frontal video of the speaker’s mouth region contains significant speech information that, when combined with the acoustic signal, can improve accuracy and noise robustness of automatic speech recognition (ASR) systems. However, extraction of such visual speech information from full-face videos is computationally expensive, as it requires tracking faces and facial features. In addition, robust face detection remains challenging in practical human–computer interaction (HCI), where the subject’s posture and environment (lighting, background) are hard to control, and thus successfully compensate for. In this paper, in order to bypass these hindrances to practical bimodal ASR, we consider the use of a specially designed, wearable audio-visual headset, a feasible solution in certain HCI scenarios. Such a headset can consistently focus on the speaker’s mouth region, thus eliminating altogether the need for face tracking. In addition, it employs infrared illumination to provide robustness against severe lighting variations. We study the appropriateness of this novel device for audio-visual ASR by conducting both small- and large-vocabulary recognition experiments on data recorded using it under various lighting conditions. We benchmark the resulting ASR performance against bimodal data containing frontal, full-face videos collected at an ideal, studio-like environment, under uniform lighting. The experiments demonstrate that the infrared headset video contains comparable speech information to the studio, full-face video data, thus being a viable sensory device for audio-visual ASR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call