Abstract

Speech emotion recognition is challenging because of the affective gap between the subjective emotions and low-level features. Integrating multilevel feature learning and model training, deep convolutional neural networks (DCNN) has exhibited remarkable success in bridging the semantic gap in visual tasks like image classification, object detection. This paper explores how to utilize a DCNN to bridge the affective gap in speech signals. To this end, we first extract three channels of log Mel-spectrograms (static, delta, and delta delta) similar to the red, green, blue (RGB) image representation as the DCNN input. Then, the AlexNet DCNN model pretrained on the large ImageNet dataset is employed to learn high-level feature representations on each segment divided from an utterance. The learned segment-level features are aggregated by a discriminant temporal pyramid matching (DTPM) strategy. DTPM combines temporal pyramid matching and optimal Lp-norm pooling to form a global utterance-level feature representation, followed by the linear support vector machines for emotion classification. Experimental results on four public datasets, that is, EMO-DB, RML, eNTERFACE05, and BAUM-1s, show the promising performance of our DCNN model and the DTPM strategy. Another interesting finding is that the DCNN model pretrained for image applications performs reasonably good in affective speech feature extraction. Further fine tuning on the target emotional speech datasets substantially promotes recognition performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.