Abstract
Tongue-lip movement visual information exhibits superior performance in the task of silent speech recognition (SSR) and is increasingly focused on by researchers. Traditionally, in the field of SSR using tongue-lip visual information, data augmentation and feature fusion have received the most attention. Different from traditional approaches, we proposed to fuse the tongue-lip visual information by a cascading fusion algorithm to achieve “visual co-occurrence” of the tongue and lip frame by frame. Besides, the pre-trained and fine-tuned frameworks based on Visual-HuBERT using masked technology are used to address the overfitting problems of model due to a lack of training data. The proposed model is evaluated based on the TaL and self-made Chinese data corpus. Experimental results show that the average word error rate (WER) of proposed model can achieve to 23.34%, representing a 1.75% reduction compared to the baseline model. In addition, the average WER of proposed model based on self-made Chinese data corpus can achieve 50.23%, and the average tone error rate can achieve to 47.04%. The above results illustrate that our proposed model not only performs better than the baseline model, but also handles tonal language in SSR, such as Mandarin.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have