Abstract

Speech Emotion Recognition (SER) plays a significant role in the field of Human–Computer Interaction (HCI) with a wide range of applications. However, there are still some issues in practical application. One of the issues is the difference between emotional expression amongst various individuals, and another is that some indistinguishable emotions may reduce the stability of the SER system. In this paper, we propose a multi-layer hybrid fuzzy support vector machine (MLHF-SVM) model, which includes three layers: feature extraction layer, pre-classification layer, and classification layer. The MLHF-SVM model solves the above-mentioned issues by fuzzy c-means (FCM) based on identification information of human and multi-layer SVM classifiers, respectively. In addition, to overcome the weakness that FCM tends to fall into local minima, an improved natural exponential inertia weight particle swarm optimization (IEPSO) algorithm is proposed and integrated with fuzzy c-means for optimization. Moreover, in the feature extraction layer, non-personalized features and personalized features are combined to improve accuracy. In order to verify the effectiveness of the proposed model, all emotions in three popular datasets are used for simulation. The results show that this model can effectively improve the success rate of classification and the maximum value of a single emotion recognition rate is 97.67% on the EmoDB dataset.

Highlights

  • With the gradual enrichment of material life, people’s attention has gradually shifted from the physical world to the spiritual world [1]

  • In the preclassification stage, the data generated by the feature extraction layer were clustered into two subsets by fuzzy c-means (FCM)-inertia weight particle swarm optimization (PSO) (IEPSO) according to gender, that is, C = 2

  • An MLHF-support vector machine (SVM) model based on clustering and classification was proposed for speech emotion classification

Read more

Summary

Introduction

With the gradual enrichment of material life, people’s attention has gradually shifted from the physical world to the spiritual world [1]. In the field of HCI, the computer can recognize emotions through gesture, audio signals, body poses, facial expression, physiological signals, and neuroimaging methods, etc. Apart from expressing emotions and communicating, speech is the most natural and fastest method for speakers to convey emotions by intonation, volume, and speed compared to others. SER, as the main form of emotional display, has focused on achieving a more natural interaction between people and machines and has become deeply involved in a wide bank of real-life applications, such as public safety [7], diagnosis of psychiatric diseases [8], adjustment of driving behavior from the state of drivers [9], web games, emergency call centers [10], and so on

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call