Abstract
To advance the study of lip-reading recognition in accordance with Chinese pronunciation norms, we carefully investigated Mandarin tone recognition based on visual information, in contrast to that of the previous character-based Chinese lip reading technique. In this paper, we mainly studied the vowel tonal transformation in Chinese pronunciation and designed a lightweight skipping convolution network framework (SCNet). And, the experimental results showed that the SCNet was sensitive to the more detailed description of the pitch change than that of the traditional model and achieved a better tone recognition effect and outstanding antiinterference performance. In addition, we conducted a more detailed study on the assistance of the deep texture information in lip-reading recognition. We found that the deep texture information has a significant effect on tone recognition, and the possibility of multimodal lip reading in Chinese tone recognition was confirmed. Similarly, we verified the role of the SCNet syllable tone recognition and found that the vowel and syllable tone recognition accuracy of our model was as high as 97.3%, which also showed the robustness of our proposed method for Chinese tone recognition and it can be widely used for tone recognition.
Highlights
The superior performance of lip reading in robust speech recognition has received widespread attention. e goal of lip reading is to improve the robustness of speech recognition in special situations such as low signalnoise ratio (SNR) or silent environments
We focus on the study of the vowel tonal changes in Chinese pronunciation
(1) For Chinese pronunciation tonal changes, we propose a new lightweight network framework, the skipping convolution network framework (SCNet), which is more sensitive to the transformation of details compared with the traditional network architecture
Summary
The superior performance of lip reading in robust speech recognition has received widespread attention. e goal of lip reading is to improve the robustness of speech recognition in special situations such as low signalnoise ratio (SNR) or silent environments. Pixel-based methods extract visual features from the image directly or after some preprocessing and transformation. Model-based methods utilize low dimensional features to express image features, and the feature is typically not changed by factors such as translation, rotation, scaling, or illumination Both methods extract relevant information directly from the region of interest (ROI) in the planar image [8]. Wang et al [13] used 3D lip points obtained from Kinect, improving the performance of multimodal speech recognition Studies by these pioneers have demonstrated the effectiveness of depth information in lip-reading recognition. E currently proposed lip-reading recognition based on 3D depth information does not consider the inherent texture problem of driving the lip motion during natural speech changes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.