Abstract

Traditionally, one single hard label determines the sentiment label of an entire utterance for speech sentiment analysis. It obviously ignores the inherent dynamic and ambiguity of speech sentiments. Moreover, there are few segment-level ground truth labels in the most existing sentiment corpora, due to the label ambiguity and annotation cost. In this work, to capture segment-level sentiment fluctuations across one utterance, we propose sentiment profiles (SPs) to express segment-level soft labels. Meanwhile, we introduce massive multi-modal wild video data to solve the data shortage problem, and facial expression knowledge is used to guide audio segments generate soft labels through the Cross-modal Sentiment Annotation Module. Then, we design a Speech Encoder Module to encode audio segments into SPs. We further exploit the sentiment profile purifier (SPP) to iteratively improve the accuracy of SPs. Numerous experiments show that our model achieves state-of-the-art performance on CH-SIMS and IEMOCAP datasets with unlabeled data respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call