Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform Synthesis

Takenori Yoshimura,Keiichi Tokuda,Keiichiro Oura,Kei Hashimoto,Yoshihiko Nankaku

doi:10.1109/taslp.2018.2818408

Abstract

This paper presents a mel-cepstrum-based quantization noise shaping method for improving the quality of synthetic speech generated by neural-network-based speech waveform synthesis systems. Since mel-cepstral coefficients closely match the characteristics of human auditory perception, the proposed method effectively masks the white noise introduced by the quantization typically used in neural-network-based speech waveform synthesis systems. The paper also describes a computationally efficient implementation of the proposed method using the structure of the mel-log spectrum approximation filter. Experiments using the WaveNet generative model, which is a state-of-the-art model for neural-network-based speech waveform synthesis, showed that speech quality is significantly improved by the proposed method.

Full Text