Abstract
The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-real-time mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-of-the-art baseline systems under various noisy and reverberant listening conditions.
Highlights
R EAL-LIFE speech communication, such as mobile telephony and public-address announcement, usually occurs in noisy and reverberant environments
Reverberation was disregarded in the training, we examined if the proposed system can work well in reverberant environments
It is worth noting that all the sentences, noises, reverberations, and signal-to-noise ratio (SNR) levels of the test set were unseen during model training
Summary
R EAL-LIFE speech communication, such as mobile telephony and public-address announcement, usually occurs in noisy and reverberant environments. One algorithm called SSDRC [7] empirically sharpens the formant information and reduces the envelope variations of a speech signal, which leads to significant intelligibility improvement Another example is a method called ASE [10], which maximizes intelligibility through certain audio manipulations, such as frequency-band decomposition and dynamic range compression, on the basis of sound engineering knowledge. Some algorithms (e.g., [5] and [8]) were proposed to maximize the speech intelligibility index (SII) [19] Another group [6], [9], [20] optimizes a glimpse-based intelligibility metric [21].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.