Abstract
The use of visual information from speaker's mouth region have shown to improve the performance of automatic speech recognition (ASR) systems. This is particularly important in presence of noise which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker's mouth region have been used to improve upon the performance of an ASR system in such challenging conditions and have met many successes. To the best of authors knowledge, the effect of using these techniques on recognition performance on the basis of phonemes have not been investigated yet. This paper presents a comparison of phoneme recognition performance using visual features extracted from mouth region-of-interest using discrete cosine transform (DCT) and discrete wavelet transform (DWT). New DCT and DWT features have also been extracted and compared with the previously used one. These features were used along with audio features based on Mel frequency cepstral coefficients (MFCC). This work will help in selecting suitable features for different application and identify the limitations of these methods in recognition of individual phonemes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.