Abstract

In recent years, Japanese Twitter-based emotional speech (JTES) was constructed as an emotional speech corpus. This corpus is based on tweets, and has features wherein an emotional label is assigned to each sentence, and sentences are selected considering the balance of both phoneme and prosody. Compared to speech recognition without emotion, emotional speech recognition is a difficult task. In this study, we aim to improve the performance of emotional speech recognition on the JTES corpus using acoustic model adaptation. For recognition, a deep neural network-based hidden Markov model (DNN-HMM) is used as the acoustic model. As a baseline, a word error rate (WER) of 38.0% was obtained when the DNN-HMM was trained by the corpus of spontaneous Japanese. This model was used as an initial model for adaptation. In this study, various types of adaptation were examined, and substantial performance improvement was achieved. Finally, a WER of 23.05% was obtained using speaker adaptation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call