Abstract

Emotion recognition from speech signals is one of the most important technologies for natural conversation between humans and robots. Most emotion recognizers extract prosodic features from an input speech in order to use emotion recognition. However, prosodic features changes drastically depending on the uttered text. In order to solve this problem, we have proposed the normalization method of prosodic features by using the synthesized speech, which has the same word sequence but uttered with a “neutral” emotion. In this method, all prosodic features (pitch, power, etc.) are normalized. However, nobody knows which kind of prosodic features should be normalized. In this paper, all combinations of with/without normalization were examined, and the most appropriate normalization method was found. When both “RMS Energy” (root mean square frame energy) and “VoiceProb” (power of harmonics divided by the total power) were normalized, emotion recognition accuracy became 5.98% higher than the recognition accuracy without normalization.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.