Concatenative text-to-speech synthesis based on waveform interpolation (a time frequency approach)

Edmilson Morais,Grzegorz Dogil

doi:10.1121/1.4777736

Abstract

The time domain pitch synchronous overlap and add (TD-PSOLA) is the technique most used in comercial concatenative text-to-speech (TTS) synthesis systems. However, it is well known that TD-PSOLA presents several drawbacks. In order to overcome some drawbacks of the TD-PSOLA, this work presents a method based on time frequency interpolation (TFI) [Yair Shoham]. The method introduced here is a pitch-synchronous time-frequency approach of the waveform interpolation technique (WI) [Bastian Kleijn]. The goal of this work is to show that the TFI technique presents some important advantages to concatenative TTS synthesis. It allows pitch scale modification (PSM) independent of time scale modification (TSM) in a quite straightforward manner, and with high quality. TSM and PSM can be done in a continuous way, without any limitation of pitch period resolution. Moreover, the TFI technique allows simple, flexible, and efficient procedures to smooth diphone (or any other kind of unit) boundaries. The proposed system was evaluated using diphones and prosodies generated by the Festival system [Alan Black, Paul Taylor]. Subjective tests were performed, between the proposed TFI system and the standard TD-PSOLA system, highlighting the superior quality of the proposed system in comparison with TD-PSOLA.

Full Text