The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset.

Duc Chung Tran

doi:10.1016/j.dib.2020.105775

Abstract

Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron-2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,000-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts.

Highlights

Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques
The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly
The model was trained by utilizing Mozilla TTS repository available at [1] and the subset data out of over 25,000 sentences given in the FPT Open Speech Data available at [2]

Summary

Introduction

The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly.

Results

Conclusion