Abstract

Abstract In this article, we propose a new set of acoustic features for automatic emotion recognition from audio. The features are based on the perceptual quality metrics that are given in perceptual evaluation of audio quality known as ITU BS.1387 recommendation. Starting from the outer and middle ear models of the auditory system, we base our features on the masked perceptual loudness which defines relatively objective criteria for emotion detection. The features computed in critical bands based on the reference concept include the partial loudness of the emotional difference, emotional difference-to-perceptual mask ratio, measures of alterations of temporal envelopes, measures of harmonics of the emotional difference, the occurrence probability of emotional blocks, and perceptual bandwidth. A soft-majority voting decision rule that strengthens the conventional majority voting is proposed to assess the classifier outputs. Compared to the state-of-the-art systems including Munich Open-Source Emotion and Affect Recognition Toolkit, Hidden Markov Toolkit, and Generalized Discriminant Analysis, it is shown that the emotion recognition rates are improved between 7-16% for EMO-DB and 7-11% in VAM for "all" and "valence" tasks.

Highlights

  • It is well known that human speech contains the linguistic content, and the emotion of the speaker

  • We propose the hypothesis that the emotional differences are more discriminative than the emotional data itself where the emotional differences are determined through perceptual masking

  • The P-support vector machine (SVM) with soft-majority voting (S-majority voting (MV)) provides a 7-16% improvement for Berlin emotional speech database (EMO-DB) and 7-11% in Vera Am Mittag emotional database (VAM) for valence as the improvement of P-Gaussian mixture model (GMM) is between 412% in EMO-DB and 6-10% in VAM for valence as well

Read more

Summary

Introduction

It is well known that human speech contains the linguistic content, and the emotion of the speaker. The frequency borders of the band pass filter range from 80 to 18000 Hz. Unlike the conventional audio feature extraction modules that mostly operates in Mel scale, in which speech contents are efficiently modeled rather than emotion, we propose working on perceptual spectrums derived in Bark scale. Let e[k, n] denote the difference between the excitation levels of reference and emotional audio computed in Bark scale k for audio frame n in dB as these frames Since both probability of detection and number of steps remaining above the loudness threshold are dependent on the excitation patterns, we can expect the excitation pattern of the audio in mode happy to have higher peaks with respect to the mode bored which are, respectively, located on the positive and negative scales of arousal.

40 WE1 NSE3 L WE2 AEB NSE1 AHSM NPD NSE2
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call