Abstract

In order to solve the problems of low efficiency and poor robustness of single-mode speech emotion recognition, this paper uses a multi-mode fusion mechanism to fuse speech and visual information across modes to build an audio-visual speech recognition (AVSR) system. The results show that the modal attention mechanism can automatically adjust to a more stable and reliable state according to the quality of a single signal, so that the audio-visual multimodal perception is accurate and the recognition efficiency is high.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call