Abstract

To advance the study of lip-reading recognition in accordance with Chinese pronunciation norms, we carefully investigated Mandarin tone recognition based on visual information, in contrast to that of the previous character-based Chinese lip reading technique. In this paper, we mainly studied the vowel tonal transformation in Chinese pronunciation and designed a lightweight skipping convolution network framework (SCNet). And, the experimental results showed that the SCNet was sensitive to the more detailed description of the pitch change than that of the traditional model and achieved a better tone recognition effect and outstanding antiinterference performance. In addition, we conducted a more detailed study on the assistance of the deep texture information in lip-reading recognition. We found that the deep texture information has a significant effect on tone recognition, and the possibility of multimodal lip reading in Chinese tone recognition was confirmed. Similarly, we verified the role of the SCNet syllable tone recognition and found that the vowel and syllable tone recognition accuracy of our model was as high as 97.3%, which also showed the robustness of our proposed method for Chinese tone recognition and it can be widely used for tone recognition.

Highlights

  • The superior performance of lip reading in robust speech recognition has received widespread attention. e goal of lip reading is to improve the robustness of speech recognition in special situations such as low signalnoise ratio (SNR) or silent environments

  • We focus on the study of the vowel tonal changes in Chinese pronunciation

  • (1) For Chinese pronunciation tonal changes, we propose a new lightweight network framework, the skipping convolution network framework (SCNet), which is more sensitive to the transformation of details compared with the traditional network architecture

Read more

Summary

Introduction

The superior performance of lip reading in robust speech recognition has received widespread attention. e goal of lip reading is to improve the robustness of speech recognition in special situations such as low signalnoise ratio (SNR) or silent environments. Pixel-based methods extract visual features from the image directly or after some preprocessing and transformation. Model-based methods utilize low dimensional features to express image features, and the feature is typically not changed by factors such as translation, rotation, scaling, or illumination Both methods extract relevant information directly from the region of interest (ROI) in the planar image [8]. Wang et al [13] used 3D lip points obtained from Kinect, improving the performance of multimodal speech recognition Studies by these pioneers have demonstrated the effectiveness of depth information in lip-reading recognition. E currently proposed lip-reading recognition based on 3D depth information does not consider the inherent texture problem of driving the lip motion during natural speech changes.

Data Collection and Feature Preprocessing
Feature Preprocessing
Network Architecture
Experiments and Results
Method
50 Gassus
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call