Abstract

This chapter discusses the issues involved in creating and using speech output in multiple languages—that is, multilingual speech synthesis—and describes some of the current technologies to build synthetic voices in new languages. It presents the basic steps involved in building synthesis in a new language, which include defining a phone set, defining a lexicon, designing a database to record, recording the database, building the synthesizer, text normalization, creation of prosodic models, evaluation and tuning, and addressing language-specific issues. Widely available tools, such as those provided in the FestVox suite, have helped to increase the number of experts trained in speech synthesis and have thus paved the way for successful research ad-commercial systems. For waveform synthesis, concatenative synthesis is the easiest technique and produce high-quality output. There are two fundamental techniques in concatenative synthesis: diphone synthesis and unit selection. Diphone synthesis follows the observation that phone boundaries are the most dynamic portions of the acoustic signal and thus the least appropriate places for joining units. Unit selection speech synthesis is based on the concatenation of appropriate sub-word units selected from a database of natural speech. A description of large evaluation efforts across languages complements this chapter.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.