Abstract

The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-real-time mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-of-the-art baseline systems under various noisy and reverberant listening conditions.

Highlights

  • R EAL-LIFE speech communication, such as mobile telephony and public-address announcement, usually occurs in noisy and reverberant environments

  • Reverberation was disregarded in the training, we examined if the proposed system can work well in reverberant environments

  • It is worth noting that all the sentences, noises, reverberations, and signal-to-noise ratio (SNR) levels of the test set were unseen during model training

Read more

Summary

INTRODUCTION

R EAL-LIFE speech communication, such as mobile telephony and public-address announcement, usually occurs in noisy and reverberant environments. One algorithm called SSDRC [7] empirically sharpens the formant information and reduces the envelope variations of a speech signal, which leads to significant intelligibility improvement Another example is a method called ASE [10], which maximizes intelligibility through certain audio manipulations, such as frequency-band decomposition and dynamic range compression, on the basis of sound engineering knowledge. Some algorithms (e.g., [5] and [8]) were proposed to maximize the speech intelligibility index (SII) [19] Another group [6], [9], [20] optimizes a glimpse-based intelligibility metric [21].

SCENARIO DESCRIPTION AND PROBLEM FORMULATION
Target Speech Metrics
System Overview
Network Architectures
Data Preparation
Implementation Details
Objective Evaluations
Subjective Listening Tests
Acoustic Analysis on Enhanced Speech
Extensions to Real-Time Execution
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.