Abstract

Machine learning and digital signal processing have been extensively used to enhance speech. However, methods to reduce early reflections in studio settings are usually related to the physical characteristics of the room. In this paper, we address the problem of early acoustic reflections in television studios and control rooms, and propose a two-stage method that exploits the knowledge of a pretrained speech synthesis generator. First, given a degraded speech signal that includes the direct sound and early reflections, a U-Net convolutional neural network is used to attenuate the early reflections in the spectral domain. Then, a pretrained speech synthesis generator reconstructs the phase to predict an enhanced speech signal in the time domain. Qualitative and quantitative experimental results demonstrate excellent studio quality of speech enhancement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call