Abstract
The paper introduces a framework for the discriminative feature construction for speech emotion recognition by jointly learning the discrete categorical and continuous emotion information. In the discrete emotion labeling approach, each utterance is assigned one label, whereas, in continuous emotion labeling, three primary attribute values (arousal, valence, and dominance) are assigned to each utterance. Each auxiliary task (arousal, valence, and dominance) is classified into low, mid, and high categories and simultaneously predicted with the main task (discrete emotion prediction). A deep CNN architecture is proposed to optimize the goal of the multi-labeling approach and is later utilized to extract the intermediate features. The extracted features are then used to train the deep neural network to classify the discrete emotion class. The proposed network is evaluated on the IEMOCAP dataset for the four emotions: Angry, excitation, neutral, and sad are used for evaluation. The proposed multi-label framework improved + 3.0% in the unweighted accuracy (UWA) compared with the single-label framework (discrete emotion prediction).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.