Abstract

Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face. In contrast, this paper investigates extracting visual speech information from the speaker's profile view, and, to our knowledge, constitutes the first real attempt to attack this problem. As with any AVASR system, the overall recognition performance depends heavily on the visual front end. This is especially the case with profile-view data, as the facial features are heavily compacted compared to the frontal scenario. In this paper, we particularly describe our visual front end approach, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video. Our experiments show that AVASR is possible from profile views with moderate performance degradation compared to frontal video data

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call