Abstract

Speech sentiments are inherently dynamic. However, most of the existing methods for speech sentiment analysis follow the paradigm: one single hard label for an entire utterance, which results in that segment-level fine-grained sentiment information is unavailable and sentiment dynamics can not be considered. In addition, due to the ambiguity of sentiments, it is difficult and time-consuming to annotate an utterance at the segment level. In this work, to alleviate the above issues, we propose to use sentiment profiles (SPs), which give a time series of the segment-level soft labels to capture the fine-grained sentiment cues across an utterance. To obtain a large number of labeled data with segment-level annotations, we propose a cross-modal knowledge transfer method at the segment level to transfer facial expression knowledge from images to audio segments. Further, we propose the sentiment profile refinery (SPR) to iteratively update sentiment profiles to improve the accuracy of SPs and overcome the noise problem in the cross-modal knowledge transfer method. Our experiments on the CH-SIMS dataset with the iQIYI-VID dataset as unlabeled data show that our method can effectively exploit additional unlabeled audio-visual data and achieve state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call