Abstract

Text-to-speech voices created from noisy and reverberant recordings are of lower quality. A simple way to improve this is to increase the quality of the recordings prior to text-to-speech training with speech enhancement methods such as noise suppression and dereverberation. In this paper, we opted for this approach and to perform the enhancement, we used a recurrent neural network. The network is trained with parallel data of clean and lower quality recordings of speech. The lower quality data was artificially created by adding recordings of environmental noise to studio-quality recordings of speech and by convolving room impulse responses with these clean recordings. We trained separate networks with noise-only, reverberation-only, and both reverberation and additive noise data. The quality of voices trained with lower quality data that has been enhanced using these networks was significantly higher in all cases. For the noise-only case, the enhanced synthetic voice ranked as high as the voice trained with clean data. For the most realistic and challenging scenario, when both noise and reverberation were present, the improvements were more modest, but still significant.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.