Abstract
We propose two novel techniques --- stacking bottleneck features and minimum generation error training criterion --- to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically--informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The minimum generation error training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory (LSTM) recurrent neural network (RNN) systems.
Highlights
Statistical parametric speech synthesis (SPSS) [1] has advanced rapidly in the last decade, as seen across the annual Blizzard Challenges [2], and can produce highlyintelligible synthesised speech with acceptable naturalness
deep neural network (DNN) have been reported to achieve significant improvements over hidden Markov model (HMM) for speech synthesis, as we reviewed in Section I-A, there are at least two limitations in current DNN implementations: III
We propose two techniques to improve the performance of DNN-based speech synthesis, namely stacked bottlenecks and a minimum generation error training criterion
Summary
Statistical parametric speech synthesis (SPSS) [1] has advanced rapidly in the last decade, as seen across the annual Blizzard Challenges [2], and can produce highlyintelligible synthesised speech with acceptable naturalness. It offers greater flexibility than the other mainstream technique of unit selection [3], the naturalness of speech generated by SPSS is still too low. We propose two novel techniques to improve this acoustic modelling Both techniques target improved modelling of the temporal natural of speech, but in different ways: one via the input linguistic features, the other via the output speech parameters. Each of them results in improvements to the subjective naturalness of the synthesised speech, and their combination gives a further improvement
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have