This study investigated the vocal emotions in Japanese by analyzing acoustic features from emotional utterances in the Online Gaming Voice Chat Corpus with Emotional Label (Arimoto and Kawatsu, 2013). The corpus contains the recorded sentences produced in 8 emotions by four native Japanese speakers who are professional actors. For acoustic feature extraction, Praat script ProsodyPro was used. Principle component analysis (PCA) was conducted to evaluate the contribution of each acoustic feature. In addition, a linear discriminant classifier (LDA) was trained with the extracted acoustic features to predict the emotion category and intensity. A generalized additive mixed model (GAMM) was performed to examine the effect of gender, emotional category, and emotional intensity on the time-normalized f0 values. The GAMM’s results suggested the effects of gender, emotion, and emotional intensity on the time-normalized f0 values of vocal emotions in Japanese. The recognition accuracy of the LDA classifier reached about 60%, suggesting that although pitch-related measures are important to differentiate vocal emotions, bio-informational features (e.g., jitter, shimmer, and harmonicity) are also informative. In addition, our correlation analysis suggested that vocal emotions could be conveyed by a set of features rather than some individual features alone.
Read full abstract