Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.

Tursunov Anvarjon,Mustaqeem Mustaqeem,Soonil Kwon

doi:10.3390/s20185212

Tursunov Anvarjon, Mustaqeem Mustaqeem + Show 1 more

Open Access

https://doi.org/10.3390/s20185212

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Sep 12, 2020
Citations: 104	License type: CC BY 4.0

Affiliation: Sejong University

Abstract

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.

Highlights

The affective content analysis of speech signals is an active area of investigation in this era
We proposed a simple and lightweight convolutional neural network (CNN) architecture with multiple layers using modified kernels and pooling strategy to detect the sensitive cues based on the extraction of the deep frequency features from the speech spectrograms, which tend to be more discriminative and reliable in speech emotion recognition
We practically prove our system which we tested over two benchmark

Summary

Introduction

The affective content analysis of speech signals is an active area of investigation in this era. Speech is the greatest prevailing way to exchange information among human beings, and it is worth paying attention to human-computer interaction (HCI). The most significant factor in human speech is emotions, which can analyze for judgments about human expressions, paralanguages, and others. The speech signal is an efficient way for the fastest communication among HCI, which efficiently recognized human behavior. Emotion recognition in a speech signal is one of the fastest emerging research field, where researchers have developed methods to naturally detect emotions from a speech signal [1,2]. The theory of speech emotion recognition (SER) is beneficial for education and health, and it will be widely used in these fields once they are proposed [3]

Methods

Results

Discussion

Conclusion