Semi-Natural and Spontaneous Speech Recognition Using Deep Neural Networks with Hybrid Features Unification

Ammar Amjad,Hsien-Tsung Chang,Lal Khan

doi:10.3390/pr9122286

Ammar Amjad, Hsien-Tsung Chang + Show 1 more

Open Access

https://doi.org/10.3390/pr9122286

Copy DOI

Journal: Processes	Publication Date: Dec 20, 2021
Citations: 11	License type: CC BY 4.0

Affiliation: Chang Gung University, Chang Gung Memorial Hospital

Abstract

Recently, identifying speech emotions in a spontaneous database has been a complex and demanding study area. This research presents an entirely new approach for recognizing semi-natural and spontaneous speech emotions with multiple feature fusion and deep neural networks (DNN). A proposed framework extracts the most discriminative features from hybrid acoustic feature sets. However, these feature sets may contain duplicate and irrelevant information, leading to inadequate emotional identification. Therefore, an support vector machine (SVM) algorithm is utilized to identify the most discriminative audio feature map after obtaining the relevant features learned by the fusion approach. We investigated our approach utilizing the eNTERFACE05 and BAUM-1s benchmark databases and observed a significant identification accuracy of 76% for a speaker-independent experiment with SVM and 59% accuracy with, respectively. Furthermore, experiments on the eNTERFACE05 and BAUM-1s dataset indicate that the suggested framework outperformed current state-of-the-art techniques on the semi-natural and spontaneous datasets.

Highlights

We suggested an Speech emotion recognition (SER) framework that addressed the issue of diverse acoustic characteristics, which typically degrades the identification efficiency of emotion classification systems
The unified and improved features are supplied into the fusion network module instead of various heterogeneous characteristics for the present recognition challenge
The suggested architecture performed well according to experimental findings on semi-natural and natural datasets

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. As a means of expressing emotion, the speech signal is significant in human communication. Sound has attracted the interest of several organizations working in the domains of human-computer interaction (HCI) [1,2]. Suppose in the framework of HCI, if the machine can identify human emotional states from dialogue, it can adapt adequate actions to interact effectively with a particular individual

Methods

Results

Conclusion