Abstract

Several studies in the past have shown that the features based on the kinematics of speech articulators improve the phonetic recognition accuracy when combined with the acoustic features. It is also known that the audio-visual speech recognition performance is better than that of the audio-only recognition, which, in turn, indicates that the information from the visible articulators is complementary to that provided by the acoustic features. Typically, visible articulators can be extracted directly from a facial video. On the other hand, the speech articulators are recorded using electromagnetic articulography (EMA), which requires highly specialized equipment. Thus, the latter is not directly available in practice and hence usually estimated from speech using acoustic-to-articulatory inversion. In this work, we compare the information provided by the visible and the estimated articulators about different phonetic classes when used with and without acoustic features. The information provided by different visible, articulatory, acoustic and combined features is quantified by the mutual information (MI). For this study, we have created a large phonetically rich audio-visual (PRAV) dataset comprising of 9000 TIMIT sentences spoken by four subjects. Experiments using PRAV corpus reveal that the articulatory features estimated by inversion are more informative than the visible features but less informative than the acoustic features. This suggests that the advantage of visible articulatory features in recognition could be achieved by recovering them from the acoustic signal itself.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call