Hybrid statistical/unit-selection Turkish speech synthesis using suffix units

Cenk Demiroğlu,Ekrem Güner

doi:10.1186/s13636-016-0082-0

Cenk Demiroğlu, Ekrem Güner

Open Access

https://doi.org/10.1186/s13636-016-0082-0

Copy DOI

Abstract

Unit selection based text-to-speech synthesis (TTS) has been the dominant TTS approach of the last decade. Despite its success, unit selection approach has its disadvantages. One of the most significant disadvantages is the sudden discontinuities in speech that distract the listeners (Speech Commun 51:1039–1064, 2009). The second disadvantage is that significant expertise and large amounts of data is needed for building a high-quality synthesis system which is costly and time-consuming. The statistical speech synthesis (SSS) approach is a promising alternative synthesis technique. Not only that the spurious errors that are observed in the unit selection system are mostly not observed in SSS but also building voice models is far less expensive and faster compared to the unit selection system. However, the resulting speech is typically not as natural-sounding as speech that is synthesized with a high-quality unit selection system. There are hybrid methods that attempt to take advantage of both SSS and unit selection systems. However, existing hybrid methods still require development of a high-quality unit selection system. Here, we propose a novel hybrid statistical/unit selection system for Turkish that aims at improving the quality of the baseline SSS system by improving the prosodic parameters such as intonation and stress. Commonly occurring suffixes in Turkish are stored in the unit selection database and used in the proposed system. As opposed to existing hybrid systems, the proposed system was developed without building a complete unit selection synthesis system. Therefore, the proposed method can be used without collecting large amounts of data or utilizing substantial expertise or time-consuming tuning that is typically required in building unit selection systems. Listeners preferred the hybrid system over the baseline system in the AB preference tests.

Highlights

The hidden Markov models (HMMs)-based text-to-speech (SSS) approach has been shown to generate good quality and intelligible speech [1]
There are hybrid systems that aim to smooth out the transitions between the units in the concatenative approach using the smooth trajectories of the SSS approach [8]
8 Conclusions A hybrid statistical/unit selection speech synthesis system is proposed that significantly improved the quality of a Turkish SSS system

Summary

Introduction

The HMM-based text-to-speech (SSS) approach has been shown to generate good quality and intelligible speech [1]. Well-tuned unit selection systems generated with substantially larger amounts of training data compared to SSS systems typically produce more natural speech compared to SSS-based systems. Hybrid SSS/unit selection methods typically attempt to improve the quality of the unit selection systems by generating speech that is smooth as in the SSS approach and natural-sounding as in the unit selection approach. SSS system is used for computing the target cost in unit selection. In a second approach, SSSgenerated waveforms are interweaved with the speech units selected from the database [6, 7]. The idea is to use smooth SSS-generated waveforms when a unit with a low cost cannot be found in the database. There are hybrid systems that aim to smooth out the transitions between the units in the concatenative approach using the smooth trajectories of the SSS approach [8]

Methods

Results

Conclusion