3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms.

Noushin Hajarolasvadi,Hasan Demirel

doi:10.3390/e21050479

Noushin Hajarolasvadi, Hasan Demirel

Open Access

https://doi.org/10.3390/e21050479

Copy DOI

Journal: Entropy	Publication Date: May 8, 2019
Citations: 114	License type: CC BY 4.0

Affiliation: Eastern Mediterranean University

Abstract

Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.

Highlights

Designing an accurate automatic emotion recognition (ER) system is crucial and beneficial to the development of many applications such as human–computer interactive (HCI) applications [1], computer-aided diagnosis systems, or deceit-analyzing systems
Taking into account the acquisition source of the data, three general groups of emotional databases exist: spontaneous emotions, acted emotions based on invocation and simulated emotions
Sample databases recorded in natural situations such as TV shows or movies are categorized under the first group

Summary

Introduction

Designing an accurate automatic emotion recognition (ER) system is crucial and beneficial to the development of many applications such as human–computer interactive (HCI) applications [1], computer-aided diagnosis systems, or deceit-analyzing systems. Three main models are in use for this purpose, namely acoustic, visual, and gestural. Speech emotion recognition (SER) is useful for addressing HCI problems provided that it can overcome challenges such as understanding the true emotional state behind spoken words. In this context, SER can be used to improve human–machine interaction by interpreting human speech. SER refers to the field of extracting semantics from speech signals. Applications such as pain and lie detection, computer-based tutorial systems, and movie or music recommendation systems that rely on the emotional state of the user can benefit from such an automatic system. The main goal of SER is to detect discriminative features of a speaker’s voice in different emotional situations

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Advancements in Speaker Recognition: Exploring Mel Frequency Cepstral Coefficients (MFCC) for Enhanced Performance in Speaker Recognition
V Sai Nitin Varma ... Abdul Majeed K.K
International Journal for Research in Applied Science and Engineering Technology | VOL. 11
V Sai Nitin Varma, et. al.V Sai Nitin Varma ... Abdul Majeed K.K
31 Aug 2023
International Journal for Research in Applied Science and Engineering Technology | VOL. 11

Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier
Fatemeh Daneshfar ... Abbas Neekabadi
Applied Acoustics | VOL. 166
Fatemeh Daneshfar, et. al.Fatemeh Daneshfar ... Abbas Neekabadi
24 Apr 2020
Applied Acoustics | VOL. 166

Spectral Features for Emotional Speaker Recognition
P Sandhya ... N.V Sobhana
-
P Sandhya, et. al.P Sandhya ... N.V Sobhana
11 Dec 2020
11 Dec 2020

Optimized Multimodal Emotional Recognition Using Long Short-Term Memory
-
Contemporaneity of English Language and Literature in the Robotized Millennium | VOL. 3
--
01 Jul 2024
Contemporaneity of English Language and Literature in the Robotized Millennium | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy