Speech enhancement based on perceptual loudness and statistical models of speech
This dissertation is concerned with speech enhancement based on the statistical and loudness models. We will study the field of speech enhancement with the objective of improving the quality of speech signals in noisy environments. First, speech enhancement based on the Laplacian model for speech signals is reviewed. The performance is shown to be limited by the accuracy of the Laplacian parameter estimation in the noisy environment. A recursive version is proposed to estimate the Laplacian model parameters using the enhanced speech and then use these estimated parameters to re-enhance the original noisy speech again. This approach achieves better parameter estimation and hence further improvements of speech quality. Next, loudness models for speech are reviewed. Considering that it describes the human hearing system better than the spectrum, the fundamental approaches of spectral subtraction are extended to the loudness domain. We propose the loudness subtraction approach. The tests are done for subtraction with different a values in the loudness model. Simulations show that the quality of enhanced speech can be optimized by choosing the appropriate a for a given input SNR. Thus, an adaptive-a subtraction model is proposed. The simulations show it can further improve the performance of spectral subtraction. Then, the proposed loudness subtraction with fixed a is shown to provide better results overall than the classical spectral subtraction, even though noise residue and unpleasant
1466
- 10.1109/89.928915
- Jul 1, 2001
- IEEE Transactions on Speech and Audio Processing
342
- 10.1109/49.138987
- Jun 1, 1992
- IEEE Journal on Selected Areas in Communications
52
- 10.1049/ip-vis:20000323
- Jan 1, 2000
- IEE Proceedings - Vision, Image, and Signal Processing
5
- 10.1109/icassp.1991.150503
- Jan 1, 1991
321
- 10.1109/tsa.2005.851927
- Sep 1, 2005
- IEEE Transactions on Speech and Audio Processing
291
- 10.1109/tassp.1977.1162974
- Aug 1, 1977
- IEEE Transactions on Acoustics, Speech, and Signal Processing
226
- 10.1109/89.482211
- Jan 1, 1995
- IEEE Transactions on Speech and Audio Processing
14
- 10.1109/iecon.1995.483843
- Jan 5, 2021
190
- 10.1109/89.701361
- Jul 1, 1998
- IEEE Transactions on Speech and Audio Processing
- Conference Article
9
- 10.1109/isie.2017.8001429
- Jun 1, 2017
This papers presents a low-latency smartphone app to achieve real-time noise reduction of speech signals in noisy sound environments. This app overcomes the two shortcomings of high latency and musical noise artifact that are associated with the previously developed apps for the same purpose. The results of both objective and subjective evaluations are reported which show the effectiveness of this app towards reducing noise for speech signals in noisy environments, in particular in babble background noise.
- Book Chapter
1
- 10.1007/978-981-10-3223-3_63
- Jun 1, 2017
Speech enhancement is a process which cultivates the quality of speech signal in noisy environment. It refers to removing or reducing the background noise in order to obtain an improved quality of original speech signal. Degradation of speech signal is most common problem in speech communication, so enhancement of the speech plays a vital role in improving the quality of speech signal. A number of methods are used for speech enhancement. Here we are using LPC based FEM in Wiener filter for speech enhancement. This method is then compared with several speech enhancement algorithms to obtain a better result or better speech quality. On comparing this method with other speech enhancement methods we obtain better speech performance. Here we are using NOIZEUS speech database in order to compare different speech enhancement methods. The experimental result shows that our proposed method provides better result and there is no information loss in original speech signal.
- Conference Article
1
- 10.21437/interspeech.2009-420
- Sep 6, 2009
This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments. Index Terms : Dereverberation, Spectral estimator, TF-GSC, AMMSE 1. Introduction The main objective of speech enhancement is to reduce the corrupting noise and reverberation from received speech signal while preserving the original speech quality as much as possible. Some of de-noising methods can also be used in dereverberation [1]. These systems mainly estimate the amplitude of short-time spectrum of clean signal; then, the phase of received signal will be added to the estimated amplitude in order to obtain the enhanced signal. By modifying these algorithms, we can also reduce the (late) reverberation in noisy environments. In [1], Habets
- Research Article
4
- 10.1049/iet-spr.2010.0012
- Apr 1, 2011
- IET Signal Processing
The authors present here a novel method for reducing the late reverberation of speech signals in noisy environments. In this method, the amplitude of clean signal is obtained by an adaptive estimator that minimises the mean square error (MSE) under signal presence uncertainty. The spectral gain function, that is an adaptive variable-order minimum MSE estimator, is obtained as a weighted geometric mean of hypothetical gains associated with speech presence and absence. The order of estimator is estimated for each time frame and each frequency component individually. The authors propose the adaptation of order of estimator according to the probability of speech presence, which makes the estimation more accurate. The evaluations confirm superiority of the proposed method in dereverberation of speech signals in noisy environments.
- Research Article
63
- 10.1109/tasl.2008.2002071
- Nov 1, 2008
- IEEE Transactions on Audio, Speech, and Language Processing
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences degrade the fidelity and intelligibility of near-end speech. In the last two decades, postfilters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near-end speech. In previous works, spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper, we derive a novel spectral variance estimator for the late reverberation of the near-end speech. Residual echo will be present at the output of the acoustic echo canceller when the acoustic echo path cannot be completely modeled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel postfilter is developed which suppresses late reverberation of the near-end speech, residual echo and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo, and background noise. </para>
- Conference Article
- 10.1109/wcica.2016.7578267
- Jun 1, 2016
Owing to the decline of recognition rate of speech recognition system in noisy environments. In signal space, the speech enhancement algorithm which combines the Priori Signal-to-Noise Ratio (SNR) with Auditory Masking Effect can effectively remove the noise of the speech signal. In feature space, improved non-uniform spectral perceptual compression feature extraction algorithm can effectively compress the speech signals in noisy environments and make the training and recognition environments more watchable. Finally, recognition rate can be greatly improved by the combination of two both. By experimenting on the intelligent medical bed platform, the result shows that the algorithm can effectively enhance the robustness of speech recognition, and ensure the recognition rate in noisy environments.
- Research Article
45
- 10.1186/1741-7007-5-52
- Nov 15, 2007
- BMC Biology
BackgroundIn daily life, we are exposed to different sound inputs simultaneously. During neural encoding in the auditory pathway, neural activities elicited by these different sounds interact with each other. In the present study, we investigated neural interactions elicited by masker and amplitude-modulated test stimulus in primary and non-primary human auditory cortex during ipsi-lateral and contra-lateral masking by means of magnetoencephalography (MEG).ResultsWe observed significant decrements of auditory evoked responses and a significant inter-hemispheric difference for the N1m response during both ipsi- and contra-lateral masking.ConclusionThe decrements of auditory evoked neural activities during simultaneous masking can be explained by neural interactions evoked by masker and test stimulus in peripheral and central auditory systems. The inter-hemispheric differences of N1m decrements during ipsi- and contra-lateral masking reflect a basic hemispheric specialization contributing to the processing of complex auditory stimuli such as speech signals in noisy environments.
- Conference Article
- 10.1117/12.205423
- Apr 6, 1995
The pitch of voiced speech sounds provides very important information in speech analysis. Pitch estimation is a difficult task when unprevented noise exists. However experimental results have shown that even robust pitch detection techniques fail in noisy environment with periodic patterns such as noise generated by machines. Wavelet transform, with its special properties in time frequency relation, can be used to detect pitch with remarkable advantage in noise resistance. In wavelet signal analysis, the modulus of the transform have been used extensively, however, we found that the phase information is equally important especially for pitch detection. Since the phase spectrum is always intensive to noise, a more promising pitch period can be obtained from the phase diagram. Properties of the phase pattern in wavelet transform are investigated and the result is applied to construct a robust pitch detector. In our first test, the detector is employed to detect the pitches of a set of speech signals with white noise. We found that our approach clearly outperforms other non-wavelet methods with low signal-to-noise ratio. Sinusoidal noise with different frequency levels is used in the second test. Simulation results have shown that our system works quite stable in such an environment.
- Research Article
10
- 10.1121/1.4949540
- May 1, 2016
- The Journal of the Acoustical Society of America
The perception of loudness is strongly influenced by peripheral auditory processing, which calls for a physiologically correct peripheral auditory processing stage when constructing advanced loudness models. Most loudness models, however, rather follow a functional approach: a parallel auditory filter bank combined with a compression stage, followed by spectral and temporal integration. Such classical loudness models do not allow to directly link physiological measurements like otoacoustic emissions to properties of their auditory filterbank. However, this can be achieved with physiologically motivated transmission-line models (TLMs) of the cochlea. Here two active and nonlinear TLMs were tested as the peripheral front end of a loudness model. The TLMs are followed by a simple generic back end which performs integration of basilar-membrane "excitation" across place and time to yield a loudness estimate. The proposed model approach reaches similar performance as other state-of-the-art loudness models regarding the prediction of loudness in sones, equal-loudness contours (including spectral fine structure), and loudness as a function of bandwidth. The suggested model provides a powerful tool to directly connect objective measures of basilar membrane compression, such as distortion product otoacoustic emissions, and loudness in future studies.
- Conference Article
1
- 10.1109/spin.2014.6777056
- Feb 1, 2014
A novel technique for speech signal reconstruction using Empirical Mode Decomposition (EMD) of speech signal in noisy condition is described in this paper. EMD is applied for finding the glottal source signal of speech signals. After getting the source information, vocal tract filter response is determined and the original speech signal is reconstructed with the help of EMD with and without prior knowledge of vocal tract filter responses. The experimental result derived establishes the effectiveness of the proposed method.
- Research Article
90
- 10.1109/tasl.2006.872619
- Jan 1, 2007
- IEEE Transactions on Audio, Speech and Language Processing
Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures
- Research Article
1
- 10.5430/air.v5n1p56
- Oct 26, 2015
- Artificial Intelligence Research
This paper presents a new approach to automatically segmenting speech signals in noisy environments. Segmentation of speech signals is formulated as an optimization problem and the boundaries of the speech segments are detected using a genetic algorithm (GA). The number of segments present in a signal is initially estimated from the reconstructed sequence of the original signal using the minimal number of Walsh basis functions. A multi-population GA is then employed to determine the locations of segment boundaries. The segmentation results are improved through the generations by introducing a new evaluation function which is based on the sample entropy and a heterogeneity measure. Experimental results show that the proposed approach can accurately detect the noisy speech segments as well as noise-only segments under various noisy conditions.
- Research Article
3
- 10.13052/jmm1550-4646.1849
- Mar 16, 2022
- Journal of Mobile Multimedia
Speech-based Interaction systems contribute to the growing class of contemporary interactive techniques (Human-Computer Interactive system), which have emerged quickly in the last few years. Versatility, multi-channel synchronization, sensitivity, and timing are all notable characteristics of speech recognition. In addition, several variables influence the precision of voice interaction recognition. However, few researchers have done a significant study on the five eco-condition variables that tend to affect speech recognition rate (SRR): ambient noise, human noise, utterance speed, and frequency. The principal strategic goal of this research is to analyze the influence of the four variables mentioned earlier on SRR, and it includes many stages of experimentation on mixed noise speech data. The sparse representation-based analyzing technique is utilized to analyze the effects. Speech recognition is not noticeably affected by a person’s usual speaking pace. As a result, high-frequency voice signals are more easily recognized (∼∼98.12%) than low-frequency speech signals in noisy environments. By performing the experiments, the test results may help design the distributive controlling and commanding systems.
- Research Article
6
- 10.1109/jbhi.2018.2836180
- May 15, 2018
- IEEE journal of biomedical and health informatics
Speech is easily affected by different background noise in real environment to reduce the speech intelligibility, in particular, for hearing impaired listeners. In order to improve the above issue, several hearing aids have been developed to enhance the speech signal in noisy environment. Most of current hearing aids were designed to enhance the component of speech and suppress the component of noise. However, it is difficult to separate other speech sources. Adaptive signal enhancement with the beamforming technique might improve the above issue. However, how to distinguish the location of the desired speaker effectively is still a difficult challenge for adaptive beamforming method. A novel concept of hearing aid was proposed in this study. Different from the beamforming-based hearing aids, which use the cross-correlation-coefficient method to estimate time difference of arrival (TDOA), an image recognition technology was used to estimate the location of the desired speaker to obtain the more precise TDOA. An adaptive signal enhancement was also used to enhance the noisy speech sound. From the experimental results, the proposed system could provide a smaller absolute error of TDOA less than 1.25 × 10-4 ms, and a clear speech sound from the target speaker who the user wants to listen to.
- Research Article
- 10.6109/jkiice.2012.16.1.059
- Jan 31, 2012
- The Journal of the Korean Institute of Information and Communication Engineering
This paper addresses a blind estimation and classification algorithm of the speech compression methods by using analysis on the structure of compressed bitstreams. Various speech compression methods including vocoders are developed in order to transmit or store the speech signals at very low bitrates. As a key feature, the vocoders contain the block structure inevitably. In classification of each compression method, we use the Measure of Inter-Block Correlation (MIBC) to check whether the bitstream includes the block structure or not, and to estimate the block length. Moreover, for the compression methods with the same block length, the proposed algorithm estimates the corresponding compression method correctly by using that each compression method has different correlation characteristics in each bit location. Experimental results indicate that the proposed algorithm classifies the speech compression methods robustly for various types and lengths of speech signals in noisy environment.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.