Speech Emotion Classification Analysis using Short-term Features

S Thirukumaran,A.F.C Archana

doi:10.4038/jsc.v8i1.2

Abstract

Speech is an auditory signal produced from the human speech production system used to express ourselves. In this era, speech signals are also used in biometric identification technologies and interacting with machines, so that it can give different response. Emotion recognition is not a new topic and researches and applications exist using different methods to extract specific features from the speech signals. This paper presents a classification analysis of emotional human speech only with short term processing features of the speech signals using artificial neural network based approach. Speech rate, pitch and energy are the most basic features of speech signal but they still have significant differences between emotions such as angry and sad. The most common way to analyze the speech emotion is to extract important features which are related to different emotion states from the voice signal. In the speech pre-processing phase, the samples of four basic types of emotional speeches sad, angry, happy, and neutral are used. Then feed those extracted short term features into the input end of the classifier and obtained different emotions at the output end. 23 short term audio signal features of 40 samples of two frames are selected and extracted from the speech signals to analyze the human emotions. These derived data along with their related emotion target matrix are fed to test and design the classifier using artificial neural network pattern recognition algorithm. The confusion matrix is generated to analyze the performance results. The overall correctly classified results for two times trained network is 73.8 %, while increasing the training times to ten, 95 % of the emotions are correctly classified. The accuracy of the neural network system is improved by multiple times of training. The overall system provides a reliable performance and correctly classifies more than 85 % for the new non-trained dataset.

Highlights

In human interaction, emotions play important role
The most common way to recognize speech emotion is to first extract important features that are related to different emotion states from the voice signal (e.g.: energy is an important feature to distinguish happy and sad), feed those features to the input end of a classifier and obtain different emotions at the output end
Samples of recorded English speech signals of four emotions are used from the Emotional Prosody speech and Transcripts in the Linguistic Data Consortium (LDC) Dataset, in which actors and actresses perform different emotions

Summary

Introduction

Human beings possess and express emotions in everyday interactions with others. There may be different types of sign that indicate emotions. In communication between human–human, emotions can be expressed in terms of verbal or facial. Speech signals contain different types of information including the information about message and speaker’s identification, emotions identification and identification of language and so on. One important aspect of human-computer interaction is to train the system to understand human emotions through voice. People can use their voice to order commands to many electrical devices such as car, smart phone, computer, etc. Make the devices understand human emotions and give a better experience of interaction. The most common way to recognize speech emotion is to first extract important features that are related to different emotion states from the voice signal (e.g.: energy is an important feature to distinguish happy and sad), feed those features to the input end of a classifier and obtain different emotions at the output end

Objectives

Methods

Results

Conclusion