Deep neural networks have recently been used in several studies to improve speech intelligibility in noise. Here, we tested whether such networks could denoise speech in a predictive manner, which would be highly desirable for potential real-time applications. Training targets for the networks consisted in a mask (ideal binary mask or ideal ratio mask) for the last observed frame (non-predictive) and for one frame ahead in the future (predictive). Frame length was fixed at 48 ms. Training was performed on a target speaker with added speech-shaped noise, using about 25 min of training speech, at different signal to noise ratios ranging from −12 to 3 dB. A behavioral experiment was run to measure intelligibility of semantically unpredictable sentences in speech-shaped noise. For the behavioral experiment, target sentences were different from the learning sentences, and novel exemplars of speech-shaped noise were drawn on each trial. We observed intelligibility gains for both network architectures over a broad range of signal to noise ratios, with a maximum of 13.6 percentage points for the non-predictive network compared to 9.4 percentage points for the predictive network. This shows that a network may successfully be trained to denoise a specific speaker in a predictive manner.
Read full abstract