Abstract

Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. To this end, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous KF or AKF-based method.

Highlights

  • Speech corrupted by background noise can reduce the efficiency of communication between speaker and listener

  • We investigate if an attention-based network can produce clean speech and noise linear prediction coefficients (LPCs) estimates with less bias and obtain higher quality and intelligibility scores than current deep learning-based Kalman filter (KF) and augmented KF (AKF) speech enhancement algorithm (SEA)

  • It can be seen that for both real-world nonstationary and coloured noise conditions, the proposed method produced lower spectral distortion (SD) levels than DeepLPC-residual network (ResNet)-TCN [34]. This demonstrates that an attention-based network is able to produce clean speech LPC estimates with less bias

Read more

Summary

Introduction

Speech corrupted by background noise (or noisy speech) can reduce the efficiency of communication between speaker and listener. A speech enhancement algorithm (SEA) can be used to suppress the embedded background noise and increase the quality and intelligibility of noisy speech [1]. SEAs are useful in many applications where noisy speech is undesirable and unavoidable. Hearing aid devices, and speech recognition systems typically rely upon SEAs for robustness. The noisy speech y(n), at discrete-time sample n, is given by: y(n) = s(n) + v(n), (1). L − 1} is the frame index with L being the total number of frames, and n {0, 1, . N − 1} where N is the total number of samples within each frame. The frame index is omitted from the following AKF recursive equations

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call