Speech enhancement based on perceptual loudness and statistical models of speech

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This dissertation is concerned with speech enhancement based on the statistical and loudness models. We will study the field of speech enhancement with the objective of improving the quality of speech signals in noisy environments. First, speech enhancement based on the Laplacian model for speech signals is reviewed. The performance is shown to be limited by the accuracy of the Laplacian parameter estimation in the noisy environment. A recursive version is proposed to estimate the Laplacian model parameters using the enhanced speech and then use these estimated parameters to re-enhance the original noisy speech again. This approach achieves better parameter estimation and hence further improvements of speech quality. Next, loudness models for speech are reviewed. Considering that it describes the human hearing system better than the spectrum, the fundamental approaches of spectral subtraction are extended to the loudness domain. We propose the loudness subtraction approach. The tests are done for subtraction with different a values in the loudness model. Simulations show that the quality of enhanced speech can be optimized by choosing the appropriate a for a given input SNR. Thus, an adaptive-a subtraction model is proposed. The simulations show it can further improve the performance of spectral subtraction. Then, the proposed loudness subtraction with fixed a is shown to provide better results overall than the classical spectral subtraction, even though noise residue and unpleasant

ReferencesShowing 9 of 101 papers
  • Open Access Icon
  • Cite Count Icon 1466
  • 10.1109/89.928915
Noise power spectral density estimation based on optimal smoothing and minimum statistics
  • Jul 1, 2001
  • IEEE Transactions on Speech and Audio Processing
  • R Martin

  • Cite Count Icon 342
  • 10.1109/49.138987
An objective measure for predicting subjective quality of speech coders
  • Jun 1, 1992
  • IEEE Journal on Selected Areas in Communications
  • S Wang + 2 more

  • Cite Count Icon 52
  • 10.1049/ip-vis:20000323
Low distortion speech enhancement
  • Jan 1, 2000
  • IEE Proceedings - Vision, Image, and Signal Processing
  • I.Y Soon + 1 more

  • Cite Count Icon 5
  • 10.1109/icassp.1991.150503
Enhancement of noisy speech by maximum likelihood estimation
  • Jan 1, 1991
  • H Kobatake + 2 more

  • Cite Count Icon 321
  • 10.1109/tsa.2005.851927
Speech enhancement based on minimum mean-square error estimation and supergaussian priors
  • Sep 1, 2005
  • IEEE Transactions on Speech and Audio Processing
  • R Martin

  • Cite Count Icon 291
  • 10.1109/tassp.1977.1162974
Adaptive transform coding of speech signals
  • Aug 1, 1977
  • IEEE Transactions on Acoustics, Speech, and Signal Processing
  • R Zelinski + 1 more

  • Cite Count Icon 226
  • 10.1109/89.482211
Reduction of broad-band noise in speech by truncated QSVD
  • Jan 1, 1995
  • IEEE Transactions on Speech and Audio Processing
  • S.H Jensen + 3 more

  • Cite Count Icon 14
  • 10.1109/iecon.1995.483843
A spectral subtraction method for the enhancement of speech corrupted by nonwhite, nonstationary noise
  • Jan 5, 2021
  • S.M Mcolash + 2 more

  • Cite Count Icon 190
  • 10.1109/89.701361
A parametric formulation of the generalized spectral subtraction method
  • Jul 1, 1998
  • IEEE Transactions on Speech and Audio Processing
  • Boh Lim Sim + 3 more

Similar Papers
  • Conference Article
  • Cite Count Icon 9
  • 10.1109/isie.2017.8001429
Low-latency smartphone app for real-time noise reduction of noisy speech signals
  • Jun 1, 2017
  • Aditya Bhattacharya + 2 more

This papers presents a low-latency smartphone app to achieve real-time noise reduction of speech signals in noisy sound environments. This app overcomes the two shortcomings of high latency and musical noise artifact that are associated with the previously developed apps for the same purpose. The results of both objective and subjective evaluations are reported which show the effectiveness of this app towards reducing noise for speech signals in noisy environments, in particular in babble background noise.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-10-3223-3_63
Secure Speech Enhancement Using LPC Based FEM in Wiener Filter
  • Jun 1, 2017
  • Kavita Bhatt + 2 more

Speech enhancement is a process which cultivates the quality of speech signal in noisy environment. It refers to removing or reducing the background noise in order to obtain an improved quality of original speech signal. Degradation of speech signal is most common problem in speech communication, so enhancement of the speech plays a vital role in improving the quality of speech signal. A number of methods are used for speech enhancement. Here we are using LPC based FEM in Wiener filter for speech enhancement. This method is then compared with several speech enhancement algorithms to obtain a better result or better speech quality. On comparing this method with other speech enhancement methods we obtain better speech performance. Here we are using NOIZEUS speech database in order to compare different speech enhancement methods. The experimental result shows that our proposed method provides better result and there is no information loss in original speech signal.

  • Conference Article
  • Cite Count Icon 1
  • 10.21437/interspeech.2009-420
Joint noise reduction and dereverberation of speech using hybrid TF-GSC and adaptive MMSE estimator
  • Sep 6, 2009
  • Behdad Dashtbozorg + 1 more

This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments. Index Terms : Dereverberation, Spectral estimator, TF-GSC, AMMSE 1. Introduction The main objective of speech enhancement is to reduce the corrupting noise and reverberation from received speech signal while preserving the original speech quality as much as possible. Some of de-noising methods can also be used in dereverberation [1]. These systems mainly estimate the amplitude of short-time spectrum of clean signal; then, the phase of received signal will be added to the estimated amplitude in order to obtain the enhanced signal. By modifying these algorithms, we can also reduce the (late) reverberation in noisy environments. In [1], Habets

  • Research Article
  • Cite Count Icon 4
  • 10.1049/iet-spr.2010.0012
Speech dereverberation in noisy environments using an adaptive minimum mean square error estimator
  • Apr 1, 2011
  • IET Signal Processing
  • H.R Abutalebi + 1 more

The authors present here a novel method for reducing the late reverberation of speech signals in noisy environments. In this method, the amplitude of clean signal is obtained by an adaptive estimator that minimises the mean square error (MSE) under signal presence uncertainty. The spectral gain function, that is an adaptive variable-order minimum MSE estimator, is obtained as a weighted geometric mean of hypothetical gains associated with speech presence and absence. The order of estimator is estimated for each time frame and each frequency component individually. The authors propose the adaptation of order of estimator according to the probability of speech presence, which makes the estimation more accurate. The evaluations confirm superiority of the proposed method in dereverberation of speech signals in noisy environments.

  • Research Article
  • Cite Count Icon 63
  • 10.1109/tasl.2008.2002071
Joint Dereverberation and Residual Echo Suppression of Speech Signals in Noisy Environments
  • Nov 1, 2008
  • IEEE Transactions on Audio, Speech, and Language Processing
  • E.A.P Habets + 3 more

<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences degrade the fidelity and intelligibility of near-end speech. In the last two decades, postfilters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near-end speech. In previous works, spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper, we derive a novel spectral variance estimator for the late reverberation of the near-end speech. Residual echo will be present at the output of the acoustic echo canceller when the acoustic echo path cannot be completely modeled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel postfilter is developed which suppresses late reverberation of the near-end speech, residual echo and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo, and background noise. </para>

  • Conference Article
  • 10.1109/wcica.2016.7578267
Robust speech recognition based on speech enhancement and improved perceptual non-uniform spectral compression
  • Jun 1, 2016
  • Yi Zhang + 3 more

Owing to the decline of recognition rate of speech recognition system in noisy environments. In signal space, the speech enhancement algorithm which combines the Priori Signal-to-Noise Ratio (SNR) with Auditory Masking Effect can effectively remove the noise of the speech signal. In feature space, improved non-uniform spectral perceptual compression feature extraction algorithm can effectively compress the speech signals in noisy environments and make the training and recognition environments more watchable. Finally, recognition rate can be greatly improved by the combination of two both. By experimenting on the intelligent medical bed platform, the result shows that the algorithm can effectively enhance the robustness of speech recognition, and ensure the recognition rate in noisy environments.

  • Research Article
  • Cite Count Icon 45
  • 10.1186/1741-7007-5-52
Left hemispheric dominance during auditory processing in a noisy environment
  • Nov 15, 2007
  • BMC Biology
  • Hidehiko Okamoto + 4 more

BackgroundIn daily life, we are exposed to different sound inputs simultaneously. During neural encoding in the auditory pathway, neural activities elicited by these different sounds interact with each other. In the present study, we investigated neural interactions elicited by masker and amplitude-modulated test stimulus in primary and non-primary human auditory cortex during ipsi-lateral and contra-lateral masking by means of magnetoencephalography (MEG).ResultsWe observed significant decrements of auditory evoked responses and a significant inter-hemispheric difference for the N1m response during both ipsi- and contra-lateral masking.ConclusionThe decrements of auditory evoked neural activities during simultaneous masking can be explained by neural interactions evoked by masker and test stimulus in peripheral and central auditory systems. The inter-hemispheric differences of N1m decrements during ipsi- and contra-lateral masking reflect a basic hemispheric specialization contributing to the processing of complex auditory stimuli such as speech signals in noisy environments.

  • Conference Article
  • 10.1117/12.205423
&lt;title&gt;Pitch detection of speech signals in noisy environment by wavelet&lt;/title&gt;
  • Apr 6, 1995
  • Wing-Kei Yip + 2 more

The pitch of voiced speech sounds provides very important information in speech analysis. Pitch estimation is a difficult task when unprevented noise exists. However experimental results have shown that even robust pitch detection techniques fail in noisy environment with periodic patterns such as noise generated by machines. Wavelet transform, with its special properties in time frequency relation, can be used to detect pitch with remarkable advantage in noise resistance. In wavelet signal analysis, the modulus of the transform have been used extensively, however, we found that the phase information is equally important especially for pitch detection. Since the phase spectrum is always intensive to noise, a more promising pitch period can be obtained from the phase diagram. Properties of the phase pattern in wavelet transform are investigated and the result is applied to construct a robust pitch detector. In our first test, the detector is employed to detect the pitches of a set of speech signals with white noise. We found that our approach clearly outperforms other non-wavelet methods with low signal-to-noise ratio. Sinusoidal noise with different frequency levels is used in the second test. Simulation results have shown that our system works quite stable in such an environment.

  • Research Article
  • Cite Count Icon 10
  • 10.1121/1.4949540
Physiological motivated transmission-lines as front end for loudness models.
  • May 1, 2016
  • The Journal of the Acoustical Society of America
  • Iko Pieper + 3 more

The perception of loudness is strongly influenced by peripheral auditory processing, which calls for a physiologically correct peripheral auditory processing stage when constructing advanced loudness models. Most loudness models, however, rather follow a functional approach: a parallel auditory filter bank combined with a compression stage, followed by spectral and temporal integration. Such classical loudness models do not allow to directly link physiological measurements like otoacoustic emissions to properties of their auditory filterbank. However, this can be achieved with physiologically motivated transmission-line models (TLMs) of the cochlea. Here two active and nonlinear TLMs were tested as the peripheral front end of a loudness model. The TLMs are followed by a simple generic back end which performs integration of basilar-membrane "excitation" across place and time to yield a loudness estimate. The proposed model approach reaches similar performance as other state-of-the-art loudness models regarding the prediction of loudness in sones, equal-loudness contours (including spectral fine structure), and loudness as a function of bandwidth. The suggested model provides a powerful tool to directly connect objective measures of basilar membrane compression, such as distortion product otoacoustic emissions, and loudness in future studies.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/spin.2014.6777056
Empirical mode decomposition based reconstruction of speech signal in noisy environment
  • Feb 1, 2014
  • Nisha Goswami + 2 more

A novel technique for speech signal reconstruction using Empirical Mode Decomposition (EMD) of speech signal in noisy condition is described in this paper. EMD is applied for finding the glottal source signal of speech signals. After getting the source information, vocal tract filter response is determined and the original speech signal is reconstructed with the help of EMD with and without prior knowledge of vocal tract filter responses. The experimental result derived establishes the effectiveness of the proposed method.

  • Research Article
  • Cite Count Icon 90
  • 10.1109/tasl.2006.872619
Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures
  • Jan 1, 2007
  • IEEE Transactions on Audio, Speech and Language Processing
  • Bertrand Rivet + 2 more

Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures

  • Research Article
  • Cite Count Icon 1
  • 10.5430/air.v5n1p56
An evolutionary approach for segmentation of noisy speech signals for efficient voice activity detection
  • Oct 26, 2015
  • Artificial Intelligence Research
  • Farook Sattar + 2 more

This paper presents a new approach to automatically segmenting speech signals in noisy environments. Segmentation of speech signals is formulated as an optimization problem and the boundaries of the speech segments are detected using a genetic algorithm (GA). The number of segments present in a signal is initially estimated from the reconstructed sequence of the original signal using the minimal number of Walsh basis functions. A multi-population GA is then employed to determine the locations of segment boundaries. The segmentation results are improved through the generations by introducing a new evaluation function which is based on the sample entropy and a heterogeneity measure. Experimental results show that the proposed approach can accurately detect the noisy speech segments as well as noise-only segments under various noisy conditions.

  • Research Article
  • Cite Count Icon 3
  • 10.13052/jmm1550-4646.1849
Data Analytics on Eco-Conditional Factors Affecting Speech Recognition Rate of Modern Interaction Systems
  • Mar 16, 2022
  • Journal of Mobile Multimedia
  • A C Kaladevi + 5 more

Speech-based Interaction systems contribute to the growing class of contemporary interactive techniques (Human-Computer Interactive system), which have emerged quickly in the last few years. Versatility, multi-channel synchronization, sensitivity, and timing are all notable characteristics of speech recognition. In addition, several variables influence the precision of voice interaction recognition. However, few researchers have done a significant study on the five eco-condition variables that tend to affect speech recognition rate (SRR): ambient noise, human noise, utterance speed, and frequency. The principal strategic goal of this research is to analyze the influence of the four variables mentioned earlier on SRR, and it includes many stages of experimentation on mixed noise speech data. The sparse representation-based analyzing technique is utilized to analyze the effects. Speech recognition is not noticeably affected by a person’s usual speaking pace. As a result, high-frequency voice signals are more easily recognized (∼∼98.12%) than low-frequency speech signals in noisy environments. By performing the experiments, the test results may help design the distributive controlling and commanding systems.

  • Research Article
  • Cite Count Icon 6
  • 10.1109/jbhi.2018.2836180
Development of Novel Hearing Aids by Using Image Recognition Technology.
  • May 15, 2018
  • IEEE journal of biomedical and health informatics
  • Bor-Shing Lin + 6 more

Speech is easily affected by different background noise in real environment to reduce the speech intelligibility, in particular, for hearing impaired listeners. In order to improve the above issue, several hearing aids have been developed to enhance the speech signal in noisy environment. Most of current hearing aids were designed to enhance the component of speech and suppress the component of noise. However, it is difficult to separate other speech sources. Adaptive signal enhancement with the beamforming technique might improve the above issue. However, how to distinguish the location of the desired speaker effectively is still a difficult challenge for adaptive beamforming method. A novel concept of hearing aid was proposed in this study. Different from the beamforming-based hearing aids, which use the cross-correlation-coefficient method to estimate time difference of arrival (TDOA), an image recognition technology was used to estimate the location of the desired speaker to obtain the more precise TDOA. An adaptive signal enhancement was also used to enhance the noisy speech sound. From the experimental results, the proposed system could provide a smaller absolute error of TDOA less than 1.25 × 10-4 ms, and a clear speech sound from the target speaker who the user wants to listen to.

  • Research Article
  • 10.6109/jkiice.2012.16.1.059
비트스트림의 구조 분석을 이용한 음성 부호화 방식 추정 기법
  • Jan 31, 2012
  • The Journal of the Korean Institute of Information and Communication Engineering
  • Hoon Yoo + 3 more

This paper addresses a blind estimation and classification algorithm of the speech compression methods by using analysis on the structure of compressed bitstreams. Various speech compression methods including vocoders are developed in order to transmit or store the speech signals at very low bitrates. As a key feature, the vocoders contain the block structure inevitably. In classification of each compression method, we use the Measure of Inter-Block Correlation (MIBC) to check whether the bitstream includes the block structure or not, and to estimate the block length. Moreover, for the compression methods with the same block length, the proposed algorithm estimates the corresponding compression method correctly by using that each compression method has different correlation characteristics in each bit location. Experimental results indicate that the proposed algorithm classifies the speech compression methods robustly for various types and lengths of speech signals in noisy environment.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon