Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Anna Přibilová,Jiří Přibil

doi:10.1186/1687-4722-2013-8

Abstract

This article analyzes and compares influence of different types of spectral and prosodic features for Czech and Slovak emotional speech classification based on Gaussian mixture models (GMM). Influence of initial setting of parameters (number of mixture components and used number of iterations) for GMM training process was analyzed, too. Subsequently, analysis was performed to find how correctness of emotion classification depends on the number and the order of the parameters in the input feature vector and on the computation complexity. Another test was carried out to verify the functionality of the proposed two-level architecture comprising the gender recognizer and of the emotional speech classifier. Next tests were realized to find dependence of some negative aspect (processing of the input speech signal with too short time duration, the gender of a speaker incorrectly determined, etc.) on the stability of the results generated during the GMM classification process. Evaluations and tests were realized with the speech material in the form of sentences of male and female speakers expressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. In addition, a comparative experiment using the speech data corpus in other language (German) was performed. The mean classification error rate of the whole classifier structure achieves about 21% for all four emotions and both genders, and the best obtained error rate was 3.5% for the sadness style of the female gender. These values are acceptable in this first stage of development of the GMM classifier. On the other hand, the test showed the principal importance of correct classification of the speaker gender in the first level, which has heavy influence on the resulting recognition score of the emotion classification. This GMM classifier should be used for evaluation of the synthetic speech quality after applied voice conversion and emotional speech style transformation.

Highlights

Speaker identification and emotional speech recognition systems, as well as speech recognition systems, use different types of speech features which can systematically be divided into segmental and supra-segmental ones [1]
The aim of this study is to develop a simple emotional speech style classifier based on Gaussian mixture models (GMM) approach usable for objective evaluation of the produced synthetic speech quality as an option to manually performed listening tests
As the score is a statistical variable containing probability/uncertainty, the results show variability which can cause erroneous emotion determination when the final score contains comparable values for more emotions

Summary

Introduction

Speaker identification and emotional speech recognition systems, as well as speech recognition systems, use different types of speech features which can systematically be divided into segmental and supra-segmental ones [1]. We are mainly focused on voice conversion and emotional speech style transformation in the text-to-speech systems speaking in Czech and Slovak [22] for the voice communication systems with the human–machine (computer) interface [23], or in the communication aids for handicapped people [24,25] These two languages (belonging to the Slavonic languages) are similar but different, we can use a common speech corpus to obtain spectral parameters, but on the phonetic and prosody level the synthetic speech must be processed separately. The article describes performed experiments and comparison of GMM classification of male and female acted speech in four emotional states (joy, sadness, anger, and a neutral state) spoken in Czech and Slovak This speech corpus was primarily used for determination of spectral and prosodic parameters for emotional speech conversion [26]. The order of parameters in the input feature vector has minimal influence on the classification error rate of the whole emotional speech classifier

Subject and method

Calculation of the frequency parameters from the zero crossing periods

F AngerM Anger

Findings

Conclusion