Abstract

Audio-visual speech recognition by machine plays an important role when research in automatic speech recognition reaches its highest performance. Audio alone also gives good performance, but adding the visual information potentially gives more convenient recognition system when an audio signal degrades in a noisy environment and may vary because of the environmental channel. This paper proposes an audio-visual automatic speech recognition (AV-ASR) system based on machine learning approaches. Visual information is captured from lip contour. Pseudo Zernike moments (PZMs) and 19th order Mel frequency cepstral coefficients (MFCCs) are extracted to obtain visual information and audio feature respectively. Machine learning approach, artificial neural networks (ANN) and support vector machines (SVM) are used to recognise speech for audio and visual modality. After the individual recognition of two systems, a combined decision is taken. This paper also evaluates the individual performance of both audio and visual speech recognition by machine learning approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.