Abstract

We have been investigating rakugo speech synthesis as a challenging example of speech synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal entertainment similar to a combination of one-person stand-up comedy and comic storytelling and is popular even today. In rakugo, a performer plays multiple characters, and conversations or dialogues between the characters make the story progress. To investigate how close the quality of synthesized rakugo speech can approach that of professionals' speech, we modeled rakugo speech using Tacotron 2, a state-of-the-art speech synthesis system that can produce speech that sounds as natural as human speech albeit under limited conditions, and an enhanced version of it with self-attention to better consider long-term dependencies. We also used global style tokens and manually labeled context features to enrich speaking styles. Through a listening test, we measured not only naturalness but also distinguishability of characters, understandability of the content, and the degree of entertainment. Although we found that the speech synthesis models could not yet reach the professional level, the results of the listening test provided interesting insights: 1) we should not focus only on the naturalness of synthesized speech but also the distinguishability of characters and the understandability of the content to further entertain audiences; 2) the fundamental frequency (fo) expressions of synthesized speech are poorer than those of human speech, and more entertaining speech should have richer fo expression. Although there is room for improvement, we believe this is an important stepping stone toward achieving entertaining speech synthesis at the professional level.

Highlights

  • Can machines read texts aloud like humans? The answer is yes, albeit under limited conditions

  • We modeled rakugo speech with two speech synthesis systems, Tacotron 2 [1] and an enhanced version of it with self-attention [14]

  • Systems (V-B) were revised to be consistent with the original Tacotron 2 paper [1], except for the kind of the attention module (IV-B and IV-C), 2) superior to all the (SA-)Tacotron was used, 3) speech samples used in the following listening test were synthesized from models based on a systematic selection: Tacotron 2 or SA-Tacotron, with or without the combination of global style tokens (GSTs) and/or manually labeled context features, and 4) a more detailed listening test was conducted involving directly asking listeners to answer how well they were entertained

Read more

Summary

INTRODUCTION

Can machines read texts aloud like humans? The answer is yes, albeit under limited conditions. Systems (V-B) were revised to be consistent with the original Tacotron 2 paper [1], except for the kind of the attention module (IV-B and IV-C), 2) SA-Tacotron was used, 3) speech samples used in the following listening test were synthesized from models based on a systematic selection: Tacotron 2 or SA-Tacotron, with or without the combination of GSTs and/or manually labeled context features, and 4) a more detailed listening test was conducted involving directly asking listeners to answer how well they were entertained. The second reason is that rakugo speech uses slightly old-fashioned Japanese dialects as mentioned in III-A, and automatic morphological analysis and pitch-accent annotation defined for modern standard Japanese do not work properly they are normally used in Japanese pipeline models as inputs [40].

RESULTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call