Recently, there have been notable advancements in TTS technology, with researchers optimizing the efficiency, quality, and flexibility of speech generation through various models. This paper systematically explores end-to-end TTS models based on waveform generation, including Parallel WaveGAN, NaturalSpeech, and Multi-Band MelGAN, each of which has unique features in enhancing real-time generation capabilities and sound quality. Additionally, the paper discusses the development of speech separation and synthesis technologies, highlighting the applications of models like CONTENTVEC in pitch adjustment and speaker information disentanglement. In terms of multimodal technology, speech-to-gesture generation has also seen important breakthroughs, utilizing multimodal information to generate natural gestures. The paper provides a detailed summary of the main datasets used in related research, such as LibriTTS, LJSpeech, and VCTK, aiming to offer reference and guidance for future research in speech generation. Although these technologies have achieved significant advancements in efficiency and multifunctionality, the associated models remain complex and require substantial computational resources, limiting their widespread application in practical scenarios.
Read full abstract