Abstract

Accurate recognition of human intent is crucial for effective human–computer speech interaction. Numerous intent understanding studies were based on speech-to-text transcription, which often overlook the influence of paralinguistic cues (such as speaker’s emotion, attitude, etc.), leading to misunderstandings, especially when identical textual content conveys multiple intents by means of different paralinguistic information. Considering that interaction intent is produced from the human brain, we propose a novel Multimodal Brain–Computer Fusion Network (MBCFNet) to discriminate the different intents carried by the identical textual information, in which the acoustic-textual representation was adapted by brain functional information through a cross-modal transformer. In the output module, a joint multi-task learning method was used to optimize both the primary task of intent recognition and the auxiliary task of emotion recognition of the speaker. To evaluate the model performance, we constructed a multimodal dataset CMSLIU that consists of acoustic, textual, and electroencephalograph (EEG) data of the same Chinese texts with varying intents. Experimental results on the self-constructed dataset indicate that the proposed model achieves state-of-the-art (SOTA) performance compared with other competing models. Ablation experiment results reveal that the model performance declines without information from EEG or audio modality, and also when the auxiliary task of emotion recognition is removed. All these results suggest the effectiveness of the proposed information fusion strategy in spoken language intent recognition and the brain–computer information fusion idea can also be extended to other similar fields.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call