Abstract

Computational modeling of human spoken language is an emerging research area in multimedia analysis spanning across the text and acoustic modalities. Multi-modal sentiment analysis is one of the most fundamental tasks in human spoken language understanding. In this paper, we propose a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities. Unlike the conventional soft attention mechanism, we employ a deep reinforcement learning mechanism to perform sentiment-relevant word selection and fully remove invalid words of each modality for multi-modal sentiment analysis. Specifically, we first align the raw text and audio at the word level and extract independent handcraft features for each modality to yield the textual and acoustic word sequence. Second, we establish two collaborative agents to deal with the textual and acoustic modalities in spoken language respectively. On this basis, we formulate the sentiment-relevant word selection process in a multi-modal setting as a multi-agent sequential decision problem and solve it with a multi-agent reinforcement learning approach. Detailed evaluations of multi-modal sentiment classification and emotion recognition on three benchmark datasets demonstrate the great effectiveness of our approach over several conventional competitive baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.