Abstract

Visual speech cues are known to improve the performance of automatic speech recognition (ASR). However, many researchers have used speaker's frontal pose mainly. We therefore introduce a new database for large vocabulary audio visual automatic speech recognition (AV-ASR), which contains not only frontal face images but also face images taken from different angles (multi-view face images). Another contribution of this paper is to present a new algorithm which can model audio and visual characteristics between phones. Finally we conducted large vocabulary continuous speech recognition experiments on the new database using the new algorithm. Experimental results show that the proposed AV-ASR system achieved high accuracy even if there are mismatches of the views between training and test data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call