Abstract

Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.

Highlights

  • Automatic speech recognition (ASR) has been confirmed to have high recognition performance by using deep learning (DL), an attractive artificial intelligence technology, and is used in various scenarios, such as voice input for mobile phones and car navigation systems

  • We focus on the first approach, and propose a feature integrationbased multi-angle visual speech recognition (VSR) system using DL, 3D convolutional neural networks (CNNs), that are one kind of deep neural networks (DNNs)

  • We proposed a multi-angle VSR system in which feature extraction was conducted using angle-specific models based on view classification results, followed by feature integration and VSR

Read more

Summary

Introduction

Automatic speech recognition (ASR) has been confirmed to have high recognition performance by using deep learning (DL), an attractive artificial intelligence technology, and is used in various scenarios, such as voice input for mobile phones and car navigation systems. There is a problem that speech waveforms are degraded by audio noise in real environments, reducing the accuracy of speech recognition. In order to overcome this issue, we need to develop robust ASR systems against any audio noise. ASR systems applicable in noisy environments is audio visual speech recognition (AVSR, known as multi-modal speech recognition), which employs ASR frameworks with visual speech recognition (VSR, known as lipreading). Owing to state-of-the-art DL technology, recently, we have achieved high performance of VSR.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call