Abstract

The automatic fluency assessment of spontaneous speech without reference text is a challenging task that heavily depends on the accuracy of automatic speech recognition (ASR). Considering this scenario, it is necessary to explore an assessment method that combines ASR. This is mainly due to the fact that in addition to acoustic features being essential for assessment, the text features output by ASR may also contain potentially fluency information. However, most existing studies on automatic fluency assessment of spontaneous speech are based solely on audio features, without utilizing textual information, which may lead to a limited understanding of fluency features. To address this, we propose a multimodal automatic speech fluency assessment method that combines ASR output. Specifically, we first explore the relevance of the fluency assessment task to the ASR task and fine-tune the Wav2Vec2.0 model using multi-task learning to jointly optimize the ASR task and fluency assessment task, resulting in both the fluency assessment results and the ASR output. Then, the text features and audio features obtained from the fine-tuned model are fed into the multimodal fluency assessment model, using attention mechanisms to obtain more reliable assessment results. Finally, experiments on the PSCPSF and Speechocean762 dataset suggest that our proposed method performs well in different assessment scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call