Evaluation of Tacotron Based Synthesizers for Spanish and Basque

Víctor García,Inma Hernáez,Eva Navas

doi:10.3390/app12031686

Abstract

In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost the alignment during the inference process. To mitigate the problem, we implemented a guided attention providing the system with the explicit duration of the phonemes. The resulting system was evaluated to assess its robustness, quality and naturalness both with objective and subjective measures. The results reveal the capacity of the system to produce good quality and natural audios.

Highlights

The aim of Text to Speech (TTS) systems is to create synthetic speech from input written language
The systems used to synthesize the utterances were the baseline system described in Section 3.1 and the Tacotron with pre-alignment guided attention described in Section 3.3 (Taco-PAG)
Neural networks allow the generation of synthetic voices that show high resemblance to natural ones, but require large amounts of data to train them

Summary

Introduction

The aim of Text to Speech (TTS) systems is to create synthetic speech from input written language. Models (HMM) [3,4] These approaches are based on complex multi-stage pipelines and require from large domain expertise, impeding wider access to this technology. Deep Neural Network (DNN)-based systems are state-of-the-art in speech synthesis [5,6]. DNNs provide TTS systems with the capacity to capture in a more efficient way the nonlinear complex relations between the voice acoustic parameters and the symbolic representation of speech. This improvement results in the generation of higher quality and more natural speech than the one obtained by traditional methods. Some examples of architectures based on DNNs are the Feed-Forward Networks (FF) [7], Recurrent Neural

Objectives

Methods

Results

Conclusion