Abstract

This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they have some drawbacks regarding synthesis quality or inference speed. LPCNet was proposed as a real-time neural vocoder with a mobile CPU but its sampling frequency is still only 16 kHz. In this paper, we propose a Full-band LPCNet to synthesize high-fidelity 48 kHz speech waveforms with a CPU by introducing some simple but effective modifications to the conventional LPCNet. We then evaluate the synthesis quality using both normal speech and a singing voice. The results of these experiments demonstrate that the proposed Full-band LPCNet is the only neural vocoder that can synthesize high-quality 48 kHz speech waveforms while maintaining real-time capability with a CPU.

Highlights

  • Text-to-speech (TTS) and singing voice synthesis are important speech technologies for creating a more accessible society, and have long been a subject of research

  • To improve the real-time factors (RTFs) of Parallel WaveGAN and PeriodNet, a C-based implementation is required instead of PyTorch, as used in Full-band LPCNet. These results suggest that a full-band real-time neural TTS can be realized by Full-band LPCNet combined with the acoustic models based on FastSpeech

  • The input feature was extended to 50-dimensional Bark-Frequency Cepstrum Coefficients (BFCCs) and the number of model parameters was increased

Read more

Summary

Introduction

Text-to-speech (TTS) and singing voice synthesis are important speech technologies for creating a more accessible society, and have long been a subject of research. A succession of TTS techniques using deep neural networks have been developed, and the quality of synthetic speech has improved significantly [1], [2]. Most neural TTS architectures consists of two modules: a neural acoustic model and a neural vocoder model. A neural vocoder model receives acoustic features from acoustic models to generate raw speech waveforms. Neural vocoders, such as the WaveNet vocoder [3] can synthesize more higher-quality speech waveforms than conventional source-filter vocoders [4]–[6], and they have greatly contributed to the improvement of neural TTS

Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.