Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

Chongchong Yu,Xiaosu Su,Zhaopeng Qian

doi:10.1109/tnsre.2023.3262001

Abstract

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Neural Systems and Rehabilitation Engineering	Publication Date: Jan 1, 2023
Citations: 11	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Neural Systems and Rehabilitation Engineering

Lead the way for us

Similar Papers

A survey of technologies for automatic Dysarthric speech recognition
Zhaopeng Qian ... Chongchong Yu
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023
Zhaopeng Qian, et. al.Zhaopeng Qian ... Chongchong Yu
11 Nov 2023
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023

Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition
Shujie Hu ... Zengrui Jin
-
Shujie Hu, et. al.Shujie Hu ... Zengrui Jin
04 Jun 2023
04 Jun 2023

An Automatic Dysarthric Speech Recognition Approach using Deep Neural Networks
Jun Ren ... Mingzhe Liu
International Journal of Advanced Computer Science and Applications | VOL. 8
Jun Ren, et. al.Jun Ren ... Mingzhe Liu
01 Jan 2017
International Journal of Advanced Computer Science and Applications | VOL. 8

Raw Source and Filter Modelling for Dysarthric Speech Recognition
Zhengjun Yue ... Erfan Loweimi
-
Zhengjun Yue, et. al.Zhengjun Yue ... Erfan Loweimi
23 May 2022
23 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Neural Systems and Rehabilitation Engineering