An In-Depth Review of Speech Enhancement Algorithms: Classifications, Underlying Principles, Challenges, and Emerging Trends
Speech enhancement aims to improve speech quality and intelligibility in noisy environments and is important in applications such as hearing aids, mobile communications and automatic speech recognition (ASR). This paper shows a structured review of speech enhancement techniques, classified depending on the channel configuration and signal processing framework. Both traditional and modern approaches are discussed, including classical signal processing methods, machine learning techniques, and recent deep learning-based models. Furthermore, common noise types, widely used speech datasets, and standard evaluation metrics for evaluating speech quality and intelligibility are reviewed. Key challenges such as non-stationary noise, data limitations, reverberation, and generalization to unseen noise conditions are highlighted. This review presents the advancements in speech enhancement and discusses the challenges and trends of this field. Valuable insights are provided for researchers, engineers, and practitioners in the area. The findings aid in the selection of suitable techniques for improved speech quality and intelligibility, and we concluded that the trend in speech enhancement has shifted from standard algorithms to deep learning methods that can efficiently learn information regarding speech signals.
- Research Article
9
- 10.1097/01.hj.0000286697.74328.32
- Apr 1, 2006
- The Hearing Journal
A likely consequence of sensorineural hearing loss is a diminished ability to understand speech in noisy backgrounds. Indeed, the inability to hear in noise is one of the main reasons for dissatisfaction with hearing aid use.1 Although hearing aids with a directional microphone provide substantial improvement in the ability of wearers to understand speech in noisy environments,2 space requirements and the uncertainty of directional characteristics when a directional hearing aid is inserted deeply into the ear canal prevent the implementation of such technology in the smallest completely-in-the-canal (CIC) hearing aids. A patient who insists on wearing CIC hearing aids will have to rely solely on the efficacy of single-mic noise-management strategies. Currently, most commercial “noise reduction” schemes use the modulation rates of the input signal as a basis for estimating the “speech” and “noise” nature of the input. “Speech” sounds are amplified with input-dependent gain, while “noise” sounds are typically amplified at a reduced gain level beyond input-dependent levels (see Chung, 20043). The exact amount of gain reduction and its time course vary greatly among manufacturers. Despite such differences, most studies have reported only an improvement in listening comfort by use of such a scheme.4 The improvement in listening comfort is noteworthy. It suggests that hearing-impaired persons whose hearing aids have noise reduction may be less affected by high-output sound pressure levels, less stressed, and less distracted in noisy situations, and may be more likely to attend to the sound sources and wear their hearing aids longer than if they lacked this feature. Wearers may also improve their participation in daily activities and their quality of life. It is possible that the reduction in stress level (or distractions) may improve the user’s speech understanding in some noisy environments.
- Research Article
3
- 10.1016/j.bspc.2022.104447
- Dec 7, 2022
- Biomedical Signal Processing and Control
Comparing the performance of classic voice-driven assistive systems for dysarthric speech
- Research Article
- 10.1177/18758967251413999
- Jan 16, 2026
- Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
Fully Connected Deep Neural Network (FCDNN) are used for speech enhancement for Hindi speech databases contaminated by a diverse range of background noises. The database includes both stationary and nonstationary noises such as Car Noise, Factory Noise, Machine Gun Noise and Fighter Plane Noise. These noises are added artificially to clean speech signal at varying input Signal-to Noise Ratio (SNR) levels i.e., −5, 0, 5, and 10 db to simulate real-world scenarios with different levels of noise interferences. The background noise, such as Machine Gun and Factory Noise are more non-stationarity compared to Car Noise and Fighter Plane Noises. This distinction underlines the importance of evaluating speech enhancement systems under diverse noise conditions to assess their robustness in real-world applications. The proposed system demonstrates significant improvements in SNR, PESQ and STOI for all four noises. Even with a speech signal corrupted by a highly nonstationary machine gun noise at −5 db input SNR level, an SNR improvement of 13.94 db with PESQ value 2.91 and STOI 0.94 is observed, which shows recovered speech quality and intelligibility is retained. Such findings from the results highlighted the effectiveness of FCDNN-based approaches in removing both stationary and nonstationary background noises from corrupted speech signals. Overall, this research contributes to enhance the quality and intelligibility of speech signals in noisy environments by leveraging the capabilities of deep learning techniques.
- Research Article
7
- 10.1109/embc.2018.8512277
- Jul 1, 2018
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
The performance of a deep-learning-based speech enhancement (SE) technology for hearing aid users, called a deep denoising autoencoder (DDAE), was investigated. The hearing-aid speech perception index (HASPI) and the hearing- aid sound quality index (HASQI), which are two well-known evaluation metrics for speech intelligibility and quality, were used to evaluate the performance of the DDAE SE approach in two typical high-frequency hearing loss (HFHL) audiograms. Our experimental results show that the DDAE SE approach yields higher intelligibility and quality scores than two classical SE approaches. These results suggest that a deep-learning-based SE method could be used to improve speech intelligibility and quality for hearing aid users in noisy environments.
- Single Book
689
- 10.1201/b14529
- Feb 25, 2013
With the proliferation of mobile devices and hearing devices, including hearing aids and cochlear implants, there is a growing and pressing need to design algorithms that can improve speech intelligibility without sacrificing quality. Responding to this need, Speech Enhancement: Theory and Practice, Second Edition introduces readers to the basic problems of speech enhancement and the various algorithms proposed to solve these problems. Updated and expanded, this second edition of the bestselling textbook broadens its scope to include evaluation measures and enhancement algorithms aimed at improving speech intelligibility.Fundamentals, Algorithms, Evaluation, and Future StepsOrganized into four parts, the book begins with a review of the fundamentals needed to understand and design better speech enhancement algorithms. The second part describes all the major enhancement algorithms and, because these require an estimate of the noise spectrum, also covers noise estimation algorithms. The third part of the book looks at the measures used to assess the performance, in terms of speech quality and intelligibility, of speech enhancement methods. It also evaluates and compares several of the algorithms. The fourth part presents binary mask algorithms for improving speech intelligibility under ideal conditions. In addition, it suggests steps that can be taken to realize the full potential of these algorithms under realistic conditions.What’s New in This EditionUpdates in every chapterA new chapter on objective speech intelligibility measuresA new chapter on algorithms for improving speech intelligibilityReal-world noise recordings (on downloadable resources)MATLAB® code for the implementation of intelligibility measures (on downloadable resources)MATLAB and C/C++ code for the implementation of algorithms to improve speech intelligibility (on downloadable resources)Valuable Insights from a Pioneer in Speech EnhancementClear and concise, this book explores how human listeners compensate for acoustic noise in noisy environments. Written by a pioneer in speech enhancement and noise reduction in cochlear implants, it is an essential resource for anyone who wants to implement or incorporate the latest speech enhancement algorithms to improve the quality and intelligibility of speech degraded by noise.Includes downloadable resources with Code and RecordingsThe downloadable resources provide MATLAB implementations of representative speech enhancement algorithms as well as speech and noise databases for the evaluation of enhancement algorithms.
- Book Chapter
- 10.1007/978-3-642-38658-9_45
- Jan 1, 2013
This paper centers on a novel approach aiming at speech enhancement in hearing aids. It consists in creating -by making use of perceptual concepts, and a supervised learning process driven by a genetic algorithm (GA)- a gain function (\(\mathcal{G}\)) that not only does it enhance the speech quality but also the speech intelligibility in noisy environments. The proposed algorithm creates the enhanced gain function by using a Gaussian mixture model fueled by the GA. To what extent the speech quality is enhanced is quantitatively measured by the algorithm itself by using a scheme based on the perceptual evaluation of speech quality (PESQ) standard. In this “blind” process, it does not use any initial information but that iteratively quantified by the PESQ measurement. The GA computes the optimized parameters that maximize the PESQ score. The experimental work, carried out over three different databases, shows how the computed gain function assists the hearing aid in enhancing speech, when compared to the values reached by using a standard hearing aid based on a multiband compressor-expander algorithm.KeywordsGaussian mixture modelgenetic algorithmsperceptual evaluation of speech qualityspeech enhancementdigital hearing aids
- Book Chapter
3
- 10.1016/b978-0-12-823898-1.00004-7
- Jan 1, 2021
- Applied Speech Processing
Chapter 3 - Modified least mean square adaptive filter for speech enhancement
- Research Article
39
- 10.1109/access.2020.3021061
- Jan 1, 2020
- IEEE Access
Human speech in real-world environments is typically degraded by the background noise. They have a negative impact on perceptual speech quality and intelligibility which causes performance degradation in various speech-related technological applications, such as hearing aids and automatic speech recognition systems. It also degrades the original phase of the clean speech and introduces perceptual disturbance which leads to the negative impacts on the quality of speech. Therefore, speech enhancement must vigilantly be dealt with in everyday listening environments. In this article, speech enhancement is performed using supervised learning of spectral masking. Deep neural networks (DNN) and recurrent neural networks (RNN) are trained to learn the spectral masking from the magnitude spectrograms of the degraded speech. An iterative procedure is adopted as a post-processing step to deal with the noisy phase. Additionally, an intelligibility improvement filter is also used to incorporate the critical band importance function weights where higher weights contribute more towards intelligibility. Systematic experiments demonstrated that the proposed approaches greatly attenuated the background noise. Also, they led to large improvements of the perceived speech quality and intelligibility, as well as automatic speech recognition. In experiments, TIMIT database is used. The STOI is improved by 17.6% over the noisy speech. Also, SDR and PESQ are improved by 5.22dB and 19% over the noisy speech utterances. These comparisons showed that the proposed speech enhancement approaches outperformed the related speech enhancement methods.
- Research Article
25
- 10.1016/j.specom.2008.05.004
- May 20, 2008
- Speech Communication
Combined speech enhancement and auditory modelling for robust distributed speech recognition
- Research Article
- 10.1038/s41598-025-05057-2
- Jul 2, 2025
- Scientific Reports
Speech enhancement (SE) and automatic speech recognition (ASR) in real-time processing involve improving the quality and intelligibility of speech signals on the fly, ensuring accurate transcription as the speech unfolds. SE eliminates unwanted background noise from target speech in environments with high background noise levels, which is crucial in real-time ASR. This study first proposes a speech enhancement network based on an attentional-codec model. Its primary objective is to suppress noise in the target speech with minimal distortion. However, excessive noise suppression in the enhanced speech can potentially diminish the effectiveness of downstream ASR systems by excluding crucial latent information. While joint SE and ASR techniques have shown promise for achieving robust end-to-end ASR, they traditionally rely on using the enhanced features as inputs to the ASR systems. To address this limitation, our study uses a dynamic fusion approach. This approach integrates both the enhanced features and the raw noisy features, aiming to eliminate noise signals from the enhanced target speech while simultaneously learning fine details from the noisy signals. This fusion approach seeks to mitigate speech distortions, enhancing the overall performance of the ASR system. The proposed model consists of an attentional codec equipped with a causal attention mechanism for SE, a GRU-based fusion network, and an ASR system. The SE network uses a modified Gated Recurrent Unit (GRU), where the traditional hyperbolic tangent (tanh) is replaced by an attention-based rectified linear unit (AReLU). The SE experiments consistently obtain better speech quality, intelligibility, and noise suppression in matched and unmatched conditions than the baselines. With the LibriSpeech database, the proposed SE obtains better STOI (19.81%) and PESQ (28.97%) in matched conditions and unmatched conditions (STOI: 17.27% and PESQ: 27.51%). The joint training framework for robust end-to-end ASR evaluates the character error rate (CER). The ASR results find that the joint training framework reduces the error rate from 32.99% (average noisy signals) to 13.52% (with the proposed SE and joint training for ASR).
- Research Article
16
- 10.1016/j.compeleceng.2020.106657
- May 27, 2020
- Computers & Electrical Engineering
Speech enhancement - an enhanced principal component analysis (EPCA) filter approach
- Research Article
9
- 10.1016/j.apacoust.2004.02.004
- Apr 12, 2004
- Applied Acoustics
Speech enhancement based on the discrete Gabor transform and multi-notch adaptive digital filters
- Research Article
57
- 10.1109/access.2019.2922370
- Jan 1, 2019
- IEEE Access
This paper presents a Speech Enhancement (SE) technique based on multi-objective learning convolutional neural network to improve the overall quality of speech perceived by Hearing Aid (HA) users. The proposed method is implemented on a smartphone as an application that performs real-time SE. This arrangement works as an assistive tool to HA. A multi-objective learning architecture including primary and secondary features uses a mapping-based convolutional neural network (CNN) model to remove noise from a noisy speech spectrum. The algorithm is computationally fast and has a low processing delay which enables it to operate seamlessly on a smartphone. The steps and the detailed analysis of real-time implementation are discussed. The proposed method is compared with existing conventional and neural network-based SE techniques through speech quality and intelligibility metrics in various noisy speech conditions. The key contribution of this paper includes the realization of CNN SE model on a smartphone processor that works seamlessly with HA. The experimental results demonstrate significant improvements over the state-of-the-art techniques and reflect the usability of the developed SE application in noisy environments.
- Research Article
10
- 10.1109/hic.2017.8227577
- Nov 1, 2017
- ... Health innovations and point-of-care technologies conference. Health innovations and point-of-care technologies conference
In this paper, we present a Speech Enhancement (SE) method implemented on a smartphone, and this arrangement functions as an assistive device to hearing aids (HA). Many benchmark single channel SE algorithms implemented on HAs provide considerable improvement in speech quality, while speech intelligibility improvement still remains a prime challenge. The proposed SE method based on Log spectral amplitude estimator improves speech intelligibility in the noisy real world acoustic environment using the priori information of formant frequency locations. The formant frequency information avails us to control the amount of speech distortion in these frequency bands, thereby controlling speech distortion. We introduce a 'scaling' parameter for the SE gain function, which controls the gains over the non-formant frequency band, allowing the HA users to customize the playback speech using a smartphone application to their listening preference. Objective intelligibility measures show the effectiveness of the proposed SE method. Subjective results reflect the suitability of the developed Speech Enhancement application in real-world noisy conditions at SNR levels of -5 dB, 0 dB and 5 dB.
- Dissertation
- 10.25904/1912/4020
- Dec 4, 2020
Speech corrupted by background noise (or noisy speech) can cause misinterpretation and fatigue during phone and conference calls, and for hearing aid users. Noisy speech can also severely impact the performance of speech processing systems such as automatic speech recognition (ASR), automatic speaker verification (ASV), and automatic speaker identification (ASI) systems. Currently, deep learning approaches are employed in an end-to-end fashion to improve robustness. The target speech (or clean speech) is used as the training target or large noisy speech datasets are used to facilitate multi-condition training. In this dissertation, we propose competitive alternatives to the preceding approaches by updating two classic robust speech processing techniques using deep learning. The two techniques include minimum mean-square error (MMSE) and missing data approaches. An MMSE estimator aims to improve the perceived quality and intelligibility of noisy speech. This is accomplished by suppressing any background noise without distorting the speech. Prior to the introduction of deep learning, MMSE estimators were the standard speech enhancement approach. MMSE estimators require the accurate estimation of the a priori signal-to-noise ratio (SNR) to attain a high level of speech enhancement performance. However, current methods produce a priori SNR estimates with a large tracking delay and a considerable amount of bias. Hence, we propose a deep learning approach to a priori SNR estimation that is significantly more accurate than previous estimators, called Deep Xi. Through objective and subjective testing across multiple conditions, such as real-world non-stationary and coloured noise sources at multiple SNR levels, we show that Deep Xi allows MMSE estimators to produce the highest quality enhanced speech amongst all clean speech magnitude spectrum estimators. Missing data approaches improve robustness by performing inference only on noisy speech features that reliably represent clean speech. In particular, the marginalisation method was able to significantly increase the robustness of Gaussian mixture model (GMM)-based speech classification systems (e.g. GMM-based ASR, ASV, or ASI systems) in the early 2000s. However, deep neural networks (DNNs) used in current speech classification systems are non-probabilistic, a requirement for marginalisation. Hence, multi-condition training or noisy speech pre-processing is used to increase the robustness of DNN-based speech classification systems. Recently, sum-product networks (SPNs) were proposed, which are deep probabilistic graphical models that can perform the probabilistic queries required for missing data approaches. While available toolkits for SPNs are in their infancy, we show through an ASI task that SPNs using missing data approaches could be a strong alternative for robust speech processing in the future. This dissertation demonstrates that MMSE estimators and missing data approaches are still relevant approaches to robust speech processing when assisted by deep learning.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.