Published in last 50 years
Articles published on Speech Enhancement
- New
- Research Article
- 10.1016/j.neunet.2025.107805
- Nov 1, 2025
- Neural networks : the official journal of the International Neural Network Society
- Xiaodong Zhu + 4 more
Lightweight real-time speech enhancement: State-space models and multi-spectral scanning techniques.
- New
- Research Article
- 10.1016/j.ijmedinf.2025.106029
- Nov 1, 2025
- International journal of medical informatics
- Chen Chen + 8 more
Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs.
- New
- Research Article
- 10.1016/j.specom.2025.103314
- Nov 1, 2025
- Speech Communication
- Junyu Wang + 5 more
LORT: Locally refined convolution and Taylor transformer for monaural speech enhancement
- New
- Research Article
- 10.1016/j.apacoust.2025.110844
- Nov 1, 2025
- Applied Acoustics
- Tianchi Sun + 5 more
Exploiting lightweight neural post-filtering for directional speech enhancement
- New
- Research Article
- 10.3390/math13213481
- Oct 31, 2025
- Mathematics
- Tsung-Jung Li + 1 more
We propose MUSE++, an advanced and lightweight speech enhancement (SE) framework that builds upon the original MUSE architecture by introducing three key improvements: a Mamba-based state space model, dynamic SNR-driven data augmentation, and an augmented multi-objective loss function. First, we replace the original multi-path enhanced Taylor (MET) transformer block with the Mamba architecture, enabling substantial reductions in model complexity and parameter count while maintaining robust enhancement capability. Second, we adopt a dynamic training strategy that varies the signal-to-noise ratios (SNRs) across diverse speech samples, promoting improved generalization to real-world acoustic scenarios. Third, we expand the model’s loss framework with additional objective measures, allowing the model to be empirically tuned towards both perceptual and objective SE metrics. Comprehensive experiments conducted on the VoiceBank-DEMAND dataset demonstrate that MUSE++ delivers consistently superior performance across standard evaluation metrics, including PESQ, CSIG, CBAK, COVL, SSNR, and STOI, while reducing the number of model parameters by over 65% compared to the baseline. These results highlight MUSE++ as a highly efficient and effective solution for speech enhancement, particularly in resource-constrained and real-time deployment scenarios.
- New
- Research Article
- 10.3397/in_2025_1074529
- Oct 22, 2025
- INTER-NOISE and NOISE-CON Congress and Conference Proceedings
- Damjan Pecioski + 5 more
In the last few years viewing an audio source has become a widely used tool in many fields such as speech recognition and enhancement and especially in noise localization in different environments. Acoustic cameras are widely used as a tool for noise source investigation, however an acoustic image is difficult to achieve in environments with a large amount of noise and reverberation. An effective approach to overcome this is the use of beamforming theory coupled with a microphone array in order to obtain a recording of the desired acoustic signal. The aim of this work is to design and develop a low cost microphone array which can be upgraded with a camera, i.e. acoustic camera, based on digital MEMS microphones and a raspberry pi. Different formations of the microphone array are developed, tested and compared. The details on hardware integration as well as the development environment are discussed in this paper.
- New
- Research Article
- 10.3390/sym17101768
- Oct 20, 2025
- Symmetry
- Yang Xian + 4 more
Deep neural network-based approaches have obtained remarkable progress in monaural speech enhancement. Nevertheless, current cutting-edge approaches remain vulnerable to complex acoustic scenarios. We propose a Symmetric Combined Convolution Network with ConvLSTM (SCCN) for monaural speech enhancement. Specifically, the Combined Convolution Block utilizes parallel convolution branches, including standard convolution and two different depthwise separable convolutions, to reinforce feature extraction in depthwise and channelwise. Similarly, Combined Deconvolution Blocks are stacked to construct the convolutional decoder. Moreover, we introduce the exponentially increasing dilation between convolutional kernel elements in the encoder and decoder, which expands receptive fields. Meanwhile, the grouped ConvLSTM layers are exploited to extract the interdependency of spatial and temporal information. The experimental results demonstrate that the proposed SCCN method obtains on average 86.00% in STOI and 2.43 in PESQ, which outperforms the state-of-the-art baseline methods, confirming the effectiveness in enhancing speech quality.
- New
- Research Article
- 10.1044/2025_aja-25-00059
- Oct 13, 2025
- American journal of audiology
- Hsuan Yun Huang + 4 more
This study investigated the effects of a soft speech enhancement algorithm on distant speech perception for adults with severe-to-profound hearing loss (SPHL), examining speech intelligibility, listening effort, and sound quality. Participants were 16 Mandarin-speaking adults (13 men, 3 women; Mage = 58 years) with symmetrical severe-to-profound sensorineural hearing loss. They had at least 1 year of hearing aid experience. A within-subject experimental design compared two hearing aid conditions: with the Speech Enhancer algorithm activated and deactivated. Speech intelligibility was assessed using the Mandarin Chinese matrix sentence test at individual speech reception thresholds. Subjective listening effort was measured using a categorical rating scale for speech presented at three distances (2, 4, and 8 m). Sound quality ratings were collected for loudness, speech understanding, and overall impression using a visual analog scale. Activation of the speech enhancement algorithm led to a notable increase in speech intelligibility from 45% to 67%. Subjective listening effort decreased significantly with the algorithm activated at all distances, with greater benefits observed at farther distances. Similarly, sound quality ratings were significantly higher with the algorithm on for all attributes across all distances, with the largest improvements in overall impression ratings at greater distances. The soft speech enhancement algorithm significantly improved speech intelligibility, reduced listening effort, and enhanced sound quality for distant speech perception in Mandarin-speaking adults with SPHL. These findings suggest that targeted signal processing for soft speech can provide substantial benefits for individuals with SPHL, including speakers of tonal languages, potentially improving communication in challenging listening situations.
- Research Article
- 10.48084/etasr.11071
- Oct 6, 2025
- Engineering, Technology & Applied Science Research
- C Shraddha + 2 more
Minimizing noise in speech signals is crucial for applications such as speech recognition and enhancement. This paper proposes a hybrid technique that combines a biorthogonal wavelet packet with a Recursive Least Squares (RLS) adaptive filter to reduce environmental and colored noise during the preprocessing stage. Simulation results demonstrate a 2–10% improvement in speech signal strength under noisy conditions. The biorthogonal wavelet's vanishing moments and the length of the RLS filter play key roles in preserving speech characteristics while suppressing noise. Performance is evaluated using Signal-to-Noise Ratio (SNR), Mean Squared Error (MSE), and Peak Signal-to-Noise Ratio (PSNR) metrics, showing effective reduction of pink and babble noise across varying decibel levels, thereby ensuring enhancing the clarity of speech for recognition applications.
- Research Article
- 10.17485/ijst/v18i35.2678
- Oct 3, 2025
- Indian Journal Of Science And Technology
- Debabrata Gogoi + 1 more
Objectives: This work aims to enhance foreground speech by effectively removing unwanted background noise and recovering the desired signal, utilizing deep learning approaches with limited training data. Methods: This study addresses the above issue using a transfer learning-based technique that uses mel-spectrograms. Specifically, it proposes a transfer learning approach that builds on a pre-trained residual network (based on wav2vec2 model) that includes a statistics pooling layer as used in speaker recognition. The model is then trained using a limited amount of clean and noisy datasets. In addition, we adopt a log mel-spectrogram feature extraction technique to improve the generalization of speech enhancement models. The database used here is from the Noisy Speech Database curated by Valentini-Botinhao, Cassia (2017) and the LibriSpeech corpus. Findings: Using the same dataset, the performances of the baseline model of an autoencoder and a multilayer autoencoder were compared with the proposed model. The proposed approach with an STOI score of 0.88 and an SNR improvement of 3.27 dB, outperforms both the baseline models in subjective and objective evaluation. Novelty: This work eliminates signal truncation, a constraint observed in conventional speech enhancement pipelines, by integrating a statistics pooling layer with a pre-trained wav2vec2-based residual network for variable-length input handling. Furthermore, the model's robustness and flexibility are enhanced by the use of log mel-spectrograms in this context, allowing it to produce state-of-the-art results even with sparse supervised training data. Keywords: Denoise, Mel-spectrogram, Signal Processing, Transfer Learning, Wav2Vec2
- Research Article
- 10.17743/jaes.2022.0222
- Oct 3, 2025
- Journal of the Audio Engineering Society
- Xiaojing Liu + 2 more
The simultaneous presence of multiple audio signals can lead to information loss due to auditory masking and interference, often resulting in diminished signal clarity. The authors propose a speech enhancement system designed to present multiple tracks of speech information with reduced auditory masking, thereby enabling more effective discernment of multiple simultaneous talkers. The system evaluates auditory masking using the ITU-R BS.1387 Perceptual Evaluation of Audio Quality model along with ideal mask ratio metrics. To achieve optimal results, a combined iterative Harmony Search algorithm and integer optimization are employed, applying audio effects such as level balancing, equalization, dynamic range compression, and spatialization, aimed at minimizing masking. Objective and subjective listening tests demonstrate that the proposed system performs competitively against mixes created by professional sound engineers and surpasses existing automixing systems. This system is applicable in various communication scenarios, including teleconferencing, in-game voice communication, and live streaming.
- Research Article
- 10.1121/10.0039557
- Oct 1, 2025
- The Journal of the Acoustical Society of America
- Zhong-Qiu Wang
The current dominant approach for neural speech enhancement is via purely supervised deep learning on simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. The trained models, however, often exhibit limited generalizability to real-recorded mixtures. To deal with this, this paper investigates training enhancement models directly on real mixtures. However, a major difficulty challenging this approach is that, since the clean speech of real mixtures is unavailable, there lacks a good supervision for real mixtures. In this context, assuming that a training set consisting of real-recorded pairs of close-talk and far-field mixtures is available, we propose to address this difficulty via close-talk speech enhancement, where an enhancement model is first trained on simulated mixtures to enhance real-recorded close-talk mixtures and the estimated close-talk speech can then be utilized as a supervision (i.e., pseudo-label) for training far-field speech enhancement models directly on the paired real-recorded far-field mixtures. We name the proposed system ctPuLSE. Evaluation results on the popular CHiME-4 dataset show that ctPuLSE can derive high-quality pseudo-labels and yield far-field speech enhancement models with strong generalizability to real data.
- Research Article
- 10.1016/j.neucom.2025.130798
- Oct 1, 2025
- Neurocomputing
- Khanh Nguyen + 1 more
Cross-architecture knowledge distillation for speech enhancement: From CMGAN to Unet
- Research Article
- 10.1049/icp.2025.2881
- Oct 1, 2025
- IET Conference Proceedings
- Jinhui Wang + 2 more
CrossModal-DEAF: a cross-modal adaptive fusion network for robust speech enhancement
- Research Article
- 10.1016/j.inffus.2025.103218
- Oct 1, 2025
- Information Fusion
- Hang Chen + 7 more
Cross-attention among spectrum, waveform and SSL representations with bidirectional knowledge distillation for speech enhancement
- Research Article
- 10.1080/17483107.2025.2559188
- Sep 26, 2025
- Disability and Rehabilitation: Assistive Technology
- Odile C Van Stuijvenberg + 5 more
Neural implants are being developed to treat various conditions, including sensory impairments such as blindness and deafness. In these technologies there is a growing role for artificial intelligence (AI) to enable interpretation of complex data input. Current users of cochlear implants (CIs) face challenges in noisy environments, prompting the development of AI-driven software for personalized and context-aware noise suppression and speech enhancement. For blindness, an AI-driven cortical visual neural implant (cVNI) for artificial visual perception is under development. Here, AI-driven software may be used to process camera imaging for interfacing with the brain. If successful, these devices can offer important advantages for their users yet may also have ethical implications. Perspectives of (potential) users of these technologies is an important source for ethical analysis, yet so far these have not been explored in-depth. We performed a focus-group and interview study including potential users of a) the AI-driven cVNI (n = 5) and of b) the AI-driven CI (n = 3), and c) current or (former) users or a retinal implant (n = 3). Focus groups and interviews were transcribed and analyzed thematically. Perspectives were clustered under 1) expectations and experiences, including improvements from the status quo, enhancement of autonomy and design requirements, and 2) perceived risks and anticipated disadvantages, including uncertainty on effectiveness, operational risks, surgical risks, and media attention. AI-driven neural implants for vision and hearing were positively received by potential users due to their potential to improve autonomy. Yet, possible conditions for uptake were identified, including device aesthetics and sufficient levels of user-control.
- Research Article
- 10.1145/3749463
- Sep 3, 2025
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
- Lei Wang + 7 more
Speech enhancement can greatly improve the user experience during phone calls in low signal-to-noise ratio (SNR) scenarios. In this paper, we propose a low-cost, energy-efficient, and environment-independent speech enhancement system, namely AccCall, that improves phone call quality using the smartphone's built-in accelerometer. However, a significant gap remains between the underlying insight and its practical applications, as several critical challenges should be addressed, including efficiency of speech enhancement in cross-user scenario, adaptive system triggering to reduce energy consumption, and lightweight deployment for real-time processing. To this end, we first design Acc-Aided Network (AccNet), a cross-modal deep learning model inherently capable of cross-user generalization through three key components, including cross-modal fusion module, accelerometer-aided (abbreviated as acc-aided) mask generator, the unified loss function. Second, we adopt a machine learning-based approach instead of deep learning to achieve high accuracy in distinguishing call activity states followed by adaptive system triggering, ensuring lower energy consumption and efficient deployment on mobile platforms. Finally, we propose a knowledge-distillation-driven structured pruning framework that optimizes model efficiency while preserving performance. Extensive experiments with 20 participants have been conducted under a user-independent scenario. The results show that AccCall achieves excellent and reliable adaptive triggering performance, and enables substantial real-time improvements in SISDR, SISNR, STOI, PESQ, and WER, demonstrating the superiority of our system in enhancing speech quality and intelligibility for phone calls.
- Research Article
- 10.3390/s25175484
- Sep 3, 2025
- Sensors (Basel, Switzerland)
- Alon Nemirovsky + 2 more
This paper presents a bilinear framework for real-time speech source separation and dereverberation tailored to wearable augmented reality devices operating in dynamic acoustic environments. Using the Speech Enhancement for Augmented Reality (SPEAR) Challenge dataset, we perform extensive validation with real-world recordings and review key algorithmic parameters, including the forgetting factor and regularization. To enhance robustness against direction-of-arrival (DOA) estimation errors caused by head movements and localization uncertainty, we propose a region-of-interest (ROI) beamformer that replaces conventional point-source steering. Additionally, we introduce a multi-constraint beamforming design capable of simultaneously preserving multiple sources or suppressing known undesired sources. Experimental results demonstrate that ROI-based steering significantly improves robustness to localization errors while maintaining effective noise and reverberation suppression. However, this comes at the cost of increased high-frequency leakage from both desired and undesired sources. The multi-constraint formulation further enhances source separation with a modest trade-off in noise reduction. The proposed integration of ROI and LCMP within the low-complexity frameworks, validated comprehensively on the SPEAR dataset, offers a practical and efficient solution for real-time audio enhancement in wearable augmented reality systems.
- Research Article
2
- 10.1016/j.neunet.2025.107562
- Sep 1, 2025
- Neural networks : the official journal of the International Neural Network Society
- Ye-Xin Lu + 2 more
Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement.
- Research Article
- 10.1016/j.compbiomed.2025.110940
- Sep 1, 2025
- Computers in biology and medicine
- Jun Zhang + 7 more
Early stroke diagnosis and evaluation based on pathological voice classification using speech enhancement.