A comparison of visual features for audiovisual automatic speech recognition

Nasir Ahmad,Omar Farooq,David Mulvaney,Sekharjit Datta

doi:10.1121/1.2936016

Abstract

The use of visual information from speaker's mouth region have shown to improve the performance of automatic speech recognition (ASR) systems. This is particularly important in presence of noise which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker's mouth region have been used to improve upon the performance of an ASR system in such challenging conditions and have met many successes. To the best of authors knowledge, the effect of using these techniques on recognition performance on the basis of phonemes have not been investigated yet. This paper presents a comparison of phoneme recognition performance using visual features extracted from mouth region-of-interest using discrete cosine transform (DCT) and discrete wavelet transform (DWT). New DCT and DWT features have also been extracted and compared with the previously used one. These features were used along with audio features based on Mel frequency cepstral coefficients (MFCC). This work will help in selecting suitable features for different application and identify the limitations of these methods in recognition of individual phonemes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A comparison of visual features for audiovisual automatic speech recognition

Abstract

Talk to us

Similar Papers

More From: The Journal of the Acoustical Society of America

Lead the way for us

Journal: The Journal of the Acoustical Society of America	Publication Date: May 1, 2008
Citations: 18

Similar Papers

A probabilistic principal component analysis based hidden Markov model for audio-visual speech recognition
Zhanyu Ma ... Arne Leijon
-
Zhanyu Ma, et. al.Zhanyu Ma ... Arne Leijon
01 Oct 2008
01 Oct 2008

Speech Recognition, Audio-Visual
G Potamianos
-
G PotamianosG Potamianos
01 Jan 2006
01 Jan 2006

Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal
Eslam E El Maghraby ... Amr M Gody
-
Eslam E El Maghraby, et. al.Eslam E El Maghraby ... Amr M Gody
01 Dec 2016
01 Dec 2016

Using DTW neural–based MFCC warping to improve emotional speech recognition
Mansour Sheikhan ... Davood Gharavian
Neural Computing and Applications | VOL. 21
Mansour Sheikhan, et. al.Mansour Sheikhan ... Davood Gharavian
15 May 2011
Neural Computing and Applications | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comparison of visual features for audiovisual automatic speech recognition

Abstract

Talk to us

Similar Papers

More From: The Journal of the Acoustical Society of America