Abstract

This study aims to recognize the deep features in speech for the emotion recognition task, with less complex architecture and fewer learnable parameters. We have proposed a simple CNN (convolutional neural network) architecture, based on log-mel-spectrograms of segmented speech utterances. The proposed architecture is used to extract the emotion-related features for two principally used databases in speech emotion recognition applications, Interactive emotional dyadic motion capture (IEMOCAP) and the Berlin database of emotional speech (EmoDB) databases. Several extensive experiments on these datasets demonstrate the performance of the proposed model and the results are compared with the recent CNN architectures. For speaker-independent analysis, the proposed CNN network achieves classification accuracies of 59.33% and 65.47% on IEMOCAP and improvised IEMOCAP utterances respectively for four emotional classes, and 72.02% for the Berlin EmoDB databases for seven classes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.