This paper presents and discusses an analysis of multilingual emotion recognition from speech with database-specific emotional features. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. The InterFace databases included several neutral speaking styles and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, low-level features were defined and in the second high-level features were calculated from low-level features. Low-level features are composed from pitch, derivative of pitch, energy, derivative of energy, and duration of speech segments. High-level features are statistical presentations of low-level features. Database-specific emotional features were selected from high-level features that contain the most information about emotions in speech. Speaker-dependent and monolingual emotion recognisers were defined, as well as multilingual recognisers. Emotion recognition was performed using artificial neural networks. The achieved recognition accuracy was highest for speaker-dependent emotion recognition, smaller for monolingual emotion recognition and smallest for multilingual recognition. The database-specific emotional features are most convenient for use in multilingual emotion recognition. Among speaker-dependent, monolingual, and multilingual emotion recognition, the difference between emotion recognition with all high-level features and emotion recognition with database-specific emotional features is smallest for multilingual emotion recognition—3.84%.
Read full abstract