Abstract

Audio-visual speech recognition by machine plays an important role when research in automatic speech recognition reaches its highest performance. Audio alone also gives good performance, but adding the visual information potentially gives more convenient recognition system when an audio signal degrades in a noisy environment and may vary because of the environmental channel. This paper proposes an audio-visual automatic speech recognition (AV-ASR) system based on machine learning approaches. Visual information is captured from lip contour. Pseudo Zernike moments (PZMs) and 19th order Mel frequency cepstral coefficients (MFCCs) are extracted to obtain visual information and audio feature respectively. Machine learning approach, artificial neural networks (ANN) and support vector machines (SVM) are used to recognise speech for audio and visual modality. After the individual recognition of two systems, a combined decision is taken. This paper also evaluates the individual performance of both audio and visual speech recognition by machine learning approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call