Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences

Shuhei Kato,Xin Wang,Shinji Takaki,Yusuke Yasuda,Junichi Yamagishi,Erica Cooper

doi:10.1109/access.2020.3011975

Abstract

We have been investigating rakugo speech synthesis as a challenging example of speech synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal entertainment similar to a combination of one-person stand-up comedy and comic storytelling and is popular even today. In rakugo, a performer plays multiple characters, and conversations or dialogues between the characters make the story progress. To investigate how close the quality of synthesized rakugo speech can approach that of professionals' speech, we modeled rakugo speech using Tacotron 2, a state-of-the-art speech synthesis system that can produce speech that sounds as natural as human speech albeit under limited conditions, and an enhanced version of it with self-attention to better consider long-term dependencies. We also used global style tokens and manually labeled context features to enrich speaking styles. Through a listening test, we measured not only naturalness but also distinguishability of characters, understandability of the content, and the degree of entertainment. Although we found that the speech synthesis models could not yet reach the professional level, the results of the listening test provided interesting insights: 1) we should not focus only on the naturalness of synthesized speech but also the distinguishability of characters and the understandability of the content to further entertain audiences; 2) the fundamental frequency (fo) expressions of synthesized speech are poorer than those of human speech, and more entertaining speech should have richer fo expression. Although there is room for improvement, we believe this is an important stepping stone toward achieving entertaining speech synthesis at the professional level.

Highlights

Can machines read texts aloud like humans? The answer is yes, albeit under limited conditions
We modeled rakugo speech with two speech synthesis systems, Tacotron 2 [1] and an enhanced version of it with self-attention [14]
Systems (V-B) were revised to be consistent with the original Tacotron 2 paper [1], except for the kind of the attention module (IV-B and IV-C), 2) superior to all the (SA-)Tacotron was used, 3) speech samples used in the following listening test were synthesized from models based on a systematic selection: Tacotron 2 or SA-Tacotron, with or without the combination of global style tokens (GSTs) and/or manually labeled context features, and 4) a more detailed listening test was conducted involving directly asking listeners to answer how well they were entertained

Summary

INTRODUCTION

Can machines read texts aloud like humans? The answer is yes, albeit under limited conditions. Systems (V-B) were revised to be consistent with the original Tacotron 2 paper [1], except for the kind of the attention module (IV-B and IV-C), 2) SA-Tacotron was used, 3) speech samples used in the following listening test were synthesized from models based on a systematic selection: Tacotron 2 or SA-Tacotron, with or without the combination of GSTs and/or manually labeled context features, and 4) a more detailed listening test was conducted involving directly asking listeners to answer how well they were entertained. The second reason is that rakugo speech uses slightly old-fashioned Japanese dialects as mentioned in III-A, and automatic morphological analysis and pitch-accent annotation defined for modern standard Japanese do not work properly they are normally used in Japanese pipeline models as inputs [40].

RESULTS

CONCLUSION