Deep learning for minimum mean-square error approaches to speech enhancement
Deep learning for minimum mean-square error approaches to speech enhancement
- # Minimum Mean-square Error Approaches
- # Minimum Mean-square Error Short-time Spectral Amplitude
- # Objective Measures Of Speech Quality
- # Residual Long Short-term Memory
- # Speech Enhancement
- # Minimum Mean-square Error
- # Coloured Noise Sources
- # Mapping-based Approaches
- # Mean-square Error Approaches
- # Short-time Spectral Amplitude Estimator
70
- 10.1109/icassp.2016.7472934
- Mar 1, 2016
85568
- 10.1162/neco.1997.9.8.1735
- Nov 1, 1997
- Neural computation
27
- 10.1109/embc.2016.7590807
- Aug 1, 2016
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
272
- 10.1109/tassp.1980.1163353
- Feb 1, 1980
- IEEE Transactions on Acoustics, Speech, and Signal Processing
1284
- 10.1109/taslp.2014.2364452
- Jan 1, 2015
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
3385
- 10.1109/tassp.1985.1164550
- Apr 1, 1985
- IEEE Transactions on Acoustics, Speech, and Signal Processing
707
- 10.1109/5.237532
- Jan 1, 1993
- Proceedings of the IEEE
74
- 10.1109/icassp.2018.8462593
- Apr 1, 2018
60
- 10.1016/j.specom.2011.02.001
- Feb 16, 2011
- Speech Communication
4508
- 10.1109/icassp.2015.7178964
- Apr 1, 2015
- Conference Article
- 10.1109/icassp43922.2022.9746267
- May 23, 2022
Perceptual evaluation of speech quality (PESQ) is widely accepted as an effective objective metric closely related to the speech quality sensed by human listening perception. Due to its evaluation complexity and non-differentiability, PESQ is difficult to include in the cost function for deep learning-based speech enhancement. In this paper, we focus on introducing PESQ to improve Deep Xi, a recently proposed minimum mean square error (MMSE) based speech enhancement with a priori signal-to-ratio (SNR) estimated by a deep neural network. Regarding discrete a priori SNR as actions, we apply reinforcement learning (RL) to select the optimal SNR at the frame level through the reward function associated with PESQ. The experimental results show that the RL-trained network is able to achieve a better PESQ score, especially in low SNR conditions.
- Conference Article
- 10.1109/3ict64318.2024.10824570
- Nov 17, 2024
Gender-Specific Speech Enhancement Architecture for Improving Deep Neural Networks Learning
- Conference Article
11
- 10.21437/interspeech.2020-1551
- Oct 25, 2020
The existing Kalman filter (KF) suffers from poor estimates of the noise variance and the linear prediction coefficients (LPCs) in real-world noise conditions. This results in a degraded speech enhancement performance. In this paper, a deep learning approach is used to more accurately estimate the noise variance and LPCs, enabling the KF to enhance speech in various noise conditions. Specifically, a deep learning approach to MMSE-based noise power spectral density (PSD) estimation, called DeepMMSE, is used. The estimated noise PSD is used to compute the noise variance. We also construct a whitening filter with its coefficients computed from the estimated noise PSD. It is then applied to the noisy speech, yielding pre-whitened speech for computing the LPCs. The improved noise variance and LPC estimates enable the KF to minimise the residual noise and distortion in the enhanced speech. Experimental results show that the proposed method exhibits higher quality and intelligibility in the enhanced speech than the benchmark methods in various noise conditions for a wide-range of SNR levels.
- Conference Article
7
- 10.1109/csde50874.2020.9411566
- Dec 16, 2020
Currently, deep learning approaches to speech enhancement are most commonly used as front-ends for robust automatic speech recognition (ASR). A recently proposed deep learning approach to a priori SNR estimation, called Deep Xi, was able to produce enhanced speech at a higher quality and intelligibility than recent deep learning approaches to speech enhancement. Motivated by this, we investigate Deep Xi as a front-end for robust ASR. Deep Xi is evaluated using real-world non-stationary and coloured noise sources at multiple SNR levels. Our experimental investigation shows that Deep Xi as a frontend is able to produce a lower word error rate than recent deep learning approaches to speech enhancement. The results presented in this work show that Deep Xi is a viable front-end, and is able to significantly increase the robustness of an ASR system.Availability: Deep Xi is available at https://github.com/anicolson/DeepXi.
- Research Article
4
- 10.1007/s11042-022-12632-6
- Mar 9, 2022
- Multimedia Tools and Applications
Speech enhancement using U-nets with wide-context units
- Research Article
14
- 10.1186/s13634-021-00813-8
- Oct 24, 2021
- EURASIP Journal on Advances in Signal Processing
Speech is easily interfered by external environment in reality, which results in the loss of important features. Deep learning has become a popular speech enhancement method because of its superior potential in solving nonlinear mapping problems for complex features. However, the deficiency of traditional deep learning methods is the weak learning capability of important information from previous time steps and long-term event dependencies between the time-series data. To overcome this problem, we propose a novel speech enhancement method based on the fused features of deep neural networks (DNNs) and gated recurrent unit (GRU). The proposed method uses GRU to reduce the number of parameters of DNNs and acquire the context information of the speech, which improves the enhanced speech quality and intelligibility. Firstly, DNN with multiple hidden layers is used to learn the mapping relationship between the logarithmic power spectrum (LPS) features of noisy speech and clean speech. Secondly, the LPS feature of the deep neural network is fused with the noisy speech as the input of GRU network to compensate the missing context information. Finally, GRU network is performed to learn the mapping relationship between LPS features and log power spectrum features of clean speech spectrum. The proposed model is experimentally compared with traditional speech enhancement models, including DNN, CNN, LSTM and GRU. Experimental results demonstrate that the PESQ, SSNR and STOI of the proposed algorithm are improved by 30.72%, 39.84% and 5.53%, respectively, compared with the noise signal under the condition of matched noise. Under the condition of unmatched noise, the PESQ and STOI of the algorithm are improved by 23.8% and 37.36%, respectively. The advantage of the proposed method is that it uses the key information of features to suppress noise in both matched and unmatched noise cases and the proposed method outperforms other common methods in speech enhancement.
- Research Article
38
- 10.1016/j.ecoinf.2024.102514
- Feb 13, 2024
- Ecological Informatics
This study assessed water quality (WQ) in Tongi Canal, an ecologically critical and economically important urban canal in Bangladesh. The researchers employed the Root Mean Square Water Quality Index (RMS-WQI) model, utilizing seven WQ indicators, including temperature, dissolve oxygen, electrical conductivity, lead, cadmium, and iron to calculate the water quality index (WQI) score. The results showed that most of the water sampling locations showed poor WQ, with many indicators violating Bangladesh's environmental conservation regulations. This study employed eight machine learning algorithms, where the Gaussian process regression (GPR) model demonstrated superior performance (training RMSE = 1.77, testing RMSE = 0.0006) in predicting WQI scores. To validate the GPR model's performance, several performance measures, including the coefficient of determination (R2), the Nash-Sutcliffe efficiency (NSE), the model efficiency factor (MEF), Z statistics, and Taylor diagram analysis, were employed. The GPR model exhibited higher sensitivity (R2 = 1.0) and efficiency (NSE = 1.0, MEF = 0.0) in predicting WQ. The analysis of model uncertainty (standard uncertainty = 7.08 ± 0.9025; expanded uncertainty = 7.08 ± 1.846) indicates that the RMS-WQI model holds potential for assessing the WQ of inland waterbodies. These findings indicate that the RMS-WQI model could be an effective approach for assessing inland waters across Bangladesh. The study's results showed that most of the WQ indicators did not meet the recommended guidelines, indicating that the water in the Tongi Canal is unsafe and unsuitable for various purposes. The study's implications extend beyond the Tongi Canal and could contribute to WQ management initiatives across Bangladesh.
- Research Article
13
- 10.1109/access.2021.3075209
- Jan 1, 2021
- IEEE Access
Current deep learning approaches to linear prediction coefficient (LPC) estimation for the augmented Kalman filter (AKF) produce bias estimates, due to the use of a whitening filter. This severely degrades the perceived quality and intelligibility of enhanced speech produced by the AKF. In this paper, we propose a deep learning framework that produces clean speech and noise LPC estimates with significantly less bias than previous methods, by avoiding the use of a whitening filter. The proposed framework, called DeepLPC, jointly estimates the clean speech and noise LPC power spectra. The estimated clean speech and noise LPC power spectra are passed through the inverse Fourier transform to form autocorrelation matrices, which are then solved by the Levinson-Durbin recursion to form the LPCs and prediction error variances of the speech and noise for the AKF. The performance of DeepLPC is evaluated on the NOIZEUS and DEMAND Voice Bank datasets using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC is compared to six existing deep learning-based methods. Compared to other deep learning approaches to clean speech LPC estimation, DeepLPC produces a lower spectral distortion (SD) level than existing methods, confirming that it exhibits less bias. DeepLPC also produced higher objective scores than any of the competing methods (with an improvement of 0.11 for CSIG, 0.15 for CBAK, 0.14 for COVL, 0.13 for PESQ, 2.66% for STOI, 1.11 dB for SegSNR, and 1.05 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC was also the most preferred by 10 listeners. By producing less biased clean speech and noise LPC estimates, DeepLPC enables the AKF to produce enhanced speech at a higher quality and intelligibility.
- Research Article
6
- 10.1121/10.0004823
- May 1, 2021
- The Journal of the Acoustical Society of America
Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean square error (MMSE) estimator training targets. The choice of the training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, the training target that produces enhanced/separated speech at the highest quality and intelligibility and that which is best for an ASR front-end is found. Three different deep neural network (DNN) types and two datasets, which include real-world nonstationary and coloured noise sources at multiple signal-to-noise ratio (SNR) levels, were used for evaluation. Ten objective measures were employed, including the word error rate of the Deep Speech ASR system. It is found that training targets that estimate the a priori SNR for MMSE estimators produce the highest objective quality scores. Moreover, it is established that the gain of MMSE estimators and the ideal amplitude mask produce the highest objective intelligibility scores and are most suitable for an ASR front-end.
- Book Chapter
- 10.1007/978-981-99-4742-3_28
- Jan 1, 2023
Improving the Accuracy of Deep Learning Modelling Based on Statistical Calculation of Mathematical Equations
- Research Article
5
- 10.1121/10.0002113
- Oct 1, 2020
- The Journal of the Acoustical Society of America
Minimum mean-square error (MMSE) approaches to speech enhancement are widely used in the literature. The quality of enhanced speech produced by an MMSE approach is directly impacted by the accuracy of the employed a priori signal-to-noise ratio (SNR) estimator. In this paper, the a priori SNR estimate spectral distortion (SD) level that results in a just-noticeable difference (JND) in the perceived quality of MMSE approach enhanced speech is found. The JND SD level is indicative of the accuracy that an a priori SNR estimator must exceed to have no impact on the perceived quality of MMSE approach enhanced speech. To measure the JND SD level, listening tests are conducted across five SNR levels, five noise sources, and two MMSE approaches [the MMSE short-time spectral amplitude (MMSE-STSA) estimator and the Wiener filter]. A statistical analysis of the results indicates that the JND SD level increases with the SNR level, is higher for the MMSE-STSA estimator, and is not impacted by the type of background noise. Following the literature, a significant improvement in a priori SNR estimation accuracy is required to reach the JND SD level.
- Dissertation
- 10.25904/1912/4020
- Dec 4, 2020
Speech corrupted by background noise (or noisy speech) can cause misinterpretation and fatigue during phone and conference calls, and for hearing aid users. Noisy speech can also severely impact the performance of speech processing systems such as automatic speech recognition (ASR), automatic speaker verification (ASV), and automatic speaker identification (ASI) systems. Currently, deep learning approaches are employed in an end-to-end fashion to improve robustness. The target speech (or clean speech) is used as the training target or large noisy speech datasets are used to facilitate multi-condition training. In this dissertation, we propose competitive alternatives to the preceding approaches by updating two classic robust speech processing techniques using deep learning. The two techniques include minimum mean-square error (MMSE) and missing data approaches. An MMSE estimator aims to improve the perceived quality and intelligibility of noisy speech. This is accomplished by suppressing any background noise without distorting the speech. Prior to the introduction of deep learning, MMSE estimators were the standard speech enhancement approach. MMSE estimators require the accurate estimation of the a priori signal-to-noise ratio (SNR) to attain a high level of speech enhancement performance. However, current methods produce a priori SNR estimates with a large tracking delay and a considerable amount of bias. Hence, we propose a deep learning approach to a priori SNR estimation that is significantly more accurate than previous estimators, called Deep Xi. Through objective and subjective testing across multiple conditions, such as real-world non-stationary and coloured noise sources at multiple SNR levels, we show that Deep Xi allows MMSE estimators to produce the highest quality enhanced speech amongst all clean speech magnitude spectrum estimators. Missing data approaches improve robustness by performing inference only on noisy speech features that reliably represent clean speech. In particular, the marginalisation method was able to significantly increase the robustness of Gaussian mixture model (GMM)-based speech classification systems (e.g. GMM-based ASR, ASV, or ASI systems) in the early 2000s. However, deep neural networks (DNNs) used in current speech classification systems are non-probabilistic, a requirement for marginalisation. Hence, multi-condition training or noisy speech pre-processing is used to increase the robustness of DNN-based speech classification systems. Recently, sum-product networks (SPNs) were proposed, which are deep probabilistic graphical models that can perform the probabilistic queries required for missing data approaches. While available toolkits for SPNs are in their infancy, we show through an ASI task that SPNs using missing data approaches could be a strong alternative for robust speech processing in the future. This dissertation demonstrates that MMSE estimators and missing data approaches are still relevant approaches to robust speech processing when assisted by deep learning.
- Conference Article
6
- 10.1109/acssc.1995.540554
- Oct 30, 1995
An interference cancellation scheme for CDMA communication systems is presented. The algorithm is based on constrained optimization where receiver output power is minimized subject to the constraint that a desired code be passed with no distortion. Output power minimization (OPM) schemes have been shown to be equivalent to minimum mean square error (MMSE) approaches. Like MMSE approaches, OPM schemes do not require explicit knowledge of the interference structure. Unlike most MMSE approaches, however, OPM schemes do not require training. The algorithms are blind in the sense that only knowledge of the desired user's code (not the actual transmitted bit sequence) and associated timing is necessary. Moreover, the algorithms are linear and, therefore, not susceptible to misconvergence. The particular OPM implementation presented is based on the generalized sidelobe canceller structure which was originally developed for adaptive cancellation of spatial interference in array signal processing applications. Algorithm performance results are presented in the form of signal to interference ratios and bit error rate curves.
- Research Article
48
- 10.1016/j.apacoust.2020.107647
- Sep 11, 2020
- Applied Acoustics
LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement
- Conference Article
2
- 10.1109/icspcc52875.2021.9564668
- Aug 17, 2021
Presently, because of the development of deep learning technology, there has been increasingly more attention on state-of-the-art masking and mapping based speech enhancement methods. However, traditional speech enhancement approaches, like minimum mean-square error (MMSE) and wiener filter (WF) have not been fully investigated. In order to the better characterize, we proposed a deep learning based MMSE approach for single-channel speech enhancement based on Non-negative Matrix Factorization (NMF). The performance of MMSE approach can be improved by a priori signal-to-noise ratio. Therefore, we utilized an NMF-based Densely Connected Convolutional Network (DenseNet) as an estimator of the a priori signal-to-noise ratio (SNR). In test stage, multiple SNR level speech from colored noise sources and real-world non-stationary noise sources were used for evaluation. As expected, our present study outperformed many previous speech enhancement methods.
- Research Article
12
- 10.1016/j.specom.2014.12.002
- Dec 9, 2014
- Speech Communication
Speech enhancement based on β-order MMSE estimation of Short Time Spectral Amplitude and Laplacian speech modeling
- Conference Article
1
- 10.1117/12.863504
- Jul 11, 2010
An important issue in Wyner-Ziv video coding is the reconstruction of Wyner-Ziv frames with decoded bit-planes. So far, there are two major approaches: the Maximum a Posteriori (MAP) reconstruction and the Minimum Mean Square Error (MMSE) reconstruction algorithms. However, these approaches do not exploit smoothness constraints in natural images. In this paper, we model a Wyner-Ziv frame by Markov random fields (MRFs), and produce reconstruction results by finding an MAP estimation of the MRF model. In the MRF model, the energy function consists of two terms: a data term, MSE distortion metric in this paper, measuring the statistical correlation between side-information and the source, and a smoothness term enforcing spatial coherence. In order to better describe the spatial constraints of images, we propose a context-adaptive smoothness term by analyzing the correspondence between the output of Slepian-Wolf decoding and successive frames available at decoders. The significance of the smoothness term varies in accordance with the spatial variation within different regions. To some extent, the proposed approach is an extension to the MAP and MMSE approaches by exploiting the intrinsic smoothness characteristic of natural images. Experimental results demonstrate a considerable performance gain compared with the MAP and MMSE approaches.
- Conference Article
3
- 10.1109/wcsp.2013.6677271
- Oct 1, 2013
This paper provides a new approach to solve the handover problem in high mobility cellular communication networks. In particular, the mobile node, e.g, the high-speed train(HST), is equipped with multiple antennas by which it finds the target base station (BS) quickly with only receive beamforming. When only imperfect channel state information (CSI) is available, the output signal-to-noise ratio (SNR) of the receive beamformer will be interfered by signals from different nearby BSs, which may lead to possible error in the selection of target BS. To deal with this, two methods, i.e., maximum-ratio combining (MRC) and Minimum Mean Square Error (MMSE), are proposed to obtain the optimal beamforming vectors. Simulation results show that both the proposed methods perform well, and the MMSE approach has good robustness.
- Conference Article
5
- 10.1109/hscma.2014.6843264
- May 1, 2014
In this paper, we provide a theoretical analysis of the minimum mean-square error short-time spectral amplitude (MMSE-STSA) estimator with the biased a priori SNR estimation and its extension to musical-noise-free speech enhancement. Recently, musical-noise-free speech enhancement has been proposed, where no musical noise is generated in iterative spectral subtraction. However, no existence of the musical-noise-free condition in the MMSE-STSA estimator has been reported. Therefore, in this paper, we show that the musical-noise-free condition exists in the biased MMSE-STSA estimator via the theoretical analysis. In addition, we perform comparative experiments and clarify the efficacy of the proposed musical-noise-free speech enhancement.
- Conference Article
2
- 10.1109/wocn.2013.6616178
- Jul 1, 2013
In this paper, a Dirty Paper Coding (DPC) based Minimum Mean Square Error (MMSE) beamforming design for a MIMO interference channel has been introduced. It includes signal leakage and linear transmit filters. At each transmitter, MMSE approach has been used comprising the signal with the interference. The simulation results show that the proposed design achieves significant reduction in mean square error (MSE) value as compared to leakage based MMSE approach. An optimal DPC strategy has been used for cancelling causal interference.
- Conference Article
7
- 10.1109/icassp.2005.1415749
- Mar 18, 2005
We present a low-complexity equalizer for multicarrier code-division multiple-access (MC-CDMA) downlink systems over time-varying (TV) multipath channels with non-negligible Doppler spread. The equalization algorithm, which is based on a block minimum mean-squared error (MMSE) approach, exploits the band structure of the frequency-domain channel matrix by means of a band LDL/sup H/ factorization. The complexity of the proposed block MMSE equalizer is linear in the number of subcarriers, and smaller with respect to a serial MMSE equalizer characterized by a similar performance.
- Research Article
21
- 10.1109/tsp.2003.810288
- May 1, 2003
- IEEE Transactions on Signal Processing
For the code division multiple access (CDMA) downlink channel, the chip-level equalization has been considered in this paper. There is no interference after despreading if all spreading codes are orthogonal, as in IS-95. However, it cannot be true for a frequency-selective fading channel. In this case, the chip-level equalization can be applied to restore the orthogonality. We investigate the chip-level equalization using finite impulse response (FIR) equalizers for the mobile station with multiple receive antennas. A blind approach and the minimum mean square error (MMSE) approach with code-multiplexed pilot are considered. A generalized MMSE equalization, which combines the MMSE and blind approaches together, is also investigated. It is shown that the generalized MMSE equalizer can effectively increase the number of samples to track the variation of channel and thereby performs better when the coherence time is small. In addition, we derive closed-form solutions of the blind, MMSE, and generalized MMSE equalizers for given channels.
- Research Article
2
- 10.1049/el.2012.1472
- Sep 13, 2012
- Electronics Letters
This reported work focuses on weighted sum-rate maximisation (WSRM) of cognitive radio ad-hoc networks, where a K user multiple-input multiple-output interference network (K-MIMO-IFN) uses the same spectrum with a licenced primary user. Because the WSRM problem in K-MIMO-IFN is non-convex and it is difficult to get an optimal solution directly, the weighted minimum mean square error (MMSE) approach is used to make the problem easier to handle. This report then proposes a dual-MMSE-algorithm, which iteratively finds a local optimal solution and only needs the local channel knowledge. Simulation results show that the proposed algorithm outperforms other conventional algorithms.
- Conference Article
13
- 10.1109/iscas.1993.393677
- May 3, 1993
Interpolation filtering in digital demodulators with fixed input sampling rate is treated. The procedures for optimum interpolation filter design which account for the peculiarities of the digital demodulators are presented. Minimum mean square error (MMSE) interpolation filters are discussed, and a design procedure for standard input signals is given. For some hardwired implementations the on-line computation of the filter coefficients can be advantageous. For those cases, the MMSE approach is almost infeasible and polynomial approximations are used. New design algorithms for coefficient computation for optimum polynomial interpolators are given. The design procedure is illustrated by a number of examples, and the results are compared to the nonpolynomial MMSE case. >
- Conference Article
2
- 10.1109/icdsp.2015.7252070
- Jul 1, 2015
In this paper, we address theoretical studies on the existence of musical-noise-free conditions for statistical-model-based speech enhancement methods. Recently, musical-noise-free speech enhancement has been proposed, where no musical noise is generated in iterative spectral subtraction, iterative Wiener filtering, and the minimum mean-square error short-time spectral amplitude (MMSE-STSA) estimator. As an extension of this theory to more flexible speech enhancement algorithms, in this paper, we reveal that the musical-noise-free condition exists in the methods with the a priori statistical speech models, e.g., the biased generalized MMSE-STSA estimator, via higher-order-statistics analysis. In addition, we perform comparative experiments and clarify the efficacy of the proposed musical-noise-free speech enhancement.
- New
- Research Article
- 10.1016/j.specom.2025.103314
- Nov 1, 2025
- Speech Communication
- New
- Research Article
- 10.1016/j.specom.2025.103317
- Nov 1, 2025
- Speech Communication
- New
- Research Article
- 10.1016/j.specom.2025.103328
- Nov 1, 2025
- Speech Communication
- New
- Research Article
- 10.1016/j.specom.2025.103313
- Nov 1, 2025
- Speech Communication
- New
- Research Article
- 10.1016/j.specom.2025.103327
- Nov 1, 2025
- Speech Communication
- New
- Research Article
- 10.1016/j.specom.2025.103305
- Nov 1, 2025
- Speech Communication
- Research Article
- 10.1016/j.specom.2025.103316
- Oct 1, 2025
- Speech Communication
- Research Article
- 10.1016/j.specom.2025.103323
- Oct 1, 2025
- Speech Communication
- Research Article
- 10.1016/j.specom.2025.103319
- Oct 1, 2025
- Speech Communication
- Research Article
- 10.1016/j.specom.2025.103315
- Oct 1, 2025
- Speech Communication
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.