Modulation spectrum-based speech parameter trajectory smoothing for DNN-based speech synthesis using FFT spectra

Shinnosuke Takamichi

doi:10.1109/apsipa.2017.8282234

Abstract

This paper investigates the effect of Modulation Spectrum (MS)-based preprocessing for Deep Neural Network (DNN)-based synthesis using Fast Fourier Transform (FFT) spectra. Preprocessing of of training data is an effective approach to improve an accuracy of acoustic model training. We have proposed the MS-based preprocessing called speech parameter trajectory for DNN-based synthesis using vocoder parameters and confirmed that the process improves training accuracy by removing components that are hard to be modeled with the acoustic models. On the other hand, DNN-based synthesis using FFT spectra was recently proposed, and it has a big potential for application wider than ever. This paper investigates whether the trajectory smoothing of the FFT spectra is effective or not. The experimental evaluation demonstrates that the trajectory smoothing reduces mean squared error between the predicted and target FFT spectra.

Full Text