Short-time Magnitude Spectrum Research Articles

Incorporating information from the short-time phase spectrum into a feature set for automatic speech recognition (ASR) may possibly serve to improve recognition accuracy. Currently, however, it is common practice to discard this information in favour of features that are derived purely from the short-time magnitude spectrum. There are two reasons for this: (1) the results of some well-known human listening experiments have indicated that the short-time phase spectrum conveys a negligible amount of intelligibility at the small window durations of 20–40 ms used for ASR spectral analysis, and (2) using the short-time phase spectrum directly for ASR has proven difficult from a signal processing viewpoint, due to phase-wrapping and other problems. In this article, we explore the possibility of using short-time phase spectrum information for ASR by considering the two points mentioned above. To address the first point, we review the results of our own set of human listening experiments. Contrary to previous studies, our results indicate that the short-time phase spectrum can indeed contribute significantly to speech intelligibility over small window durations of 20–40 ms. Also, the results of these listening experiments, in addition to some ASR experiments, indicate that at least part of this intelligibility may be supplementary to that provided by the short-time magnitude spectrum. To address the second point (i.e., the signal processing difficulties), we suggest that it may be necessary to transform the short-time phase spectrum into a more physically meaningful representation from which useful features could possibly be extracted. Specifically, we investigate the frequency-derivative (or group delay function, GDF) and the time-derivative (or instantaneous frequency distribution, IFD) as potential candidates for this intermediate representation. We review our recent work, where we have performed various experiments which show that the GDF and IFD may be useful for ASR. In our recent work, we have also conducted several ASR experiments to test a feature set derived from the GDF. We found that, in most cases, these features perform worse than the standard MFCC features. Therefore, we suggest that a short-time phase spectrum feature set may ultimately be derived from a concatenation of information from both the GDF and IFD representations. For best performance, the feature set may also need to be concatenated with short-time magnitude spectrum information. Further to addressing the two aforementioned points, we also discuss a number of other speech applications in which the short-time phase spectrum has proven to be very useful. We believe that an appreciation for how the short-time phase spectrum has been used for other tasks, in addition to the results of our own experiments, will provoke fellow researchers to also investigate its potential for use in ASR.

Read full abstract

State-of-the-art automatic speech recognition systems (ASRs) use only the short-time magnitude spectrum for feature extraction; the short-time phase spectrum is generally ignored in these systems. Results from our recent human listening tests indicate that the short-time phase spectrum can significantly contribute to speech intelligibility over small window durations (i.e., 20–40 ms). This is an interesting result, indicating the possible usefulness of the short-time phase spectrum for ASR, which commonly employs small window durations of 20–40 ms for spectral analysis. In this paper, we continue our investigation of the short-time phase spectrum. We explore the use of partial short-time phase spectrum information, in the absence of all the short-time magnitude spectrum information, for intelligible signal reconstruction. We create two types of stimuli; one in which its frequency-derivative (i.e., group delay function, GDF) is preserved and another in which its time-derivative (i.e., instantaneous frequency distribution, IFD) is preserved. We do this to determine the contribution that each of these derivatives provides toward intelligibility. Reconstructing stimuli from knowledge of only the GDF or only the IFD results in poor intelligibility. However, when we create stimuli using knowledge of both the GDF and the IFD, reasonable intelligibility is obtained. In light of these results, we conclude that both the GDF and IFD components of the short-time phase spectrum are needed to reconstruct an intelligible signal. In addition, we also perform some experiments to quantify the intelligibility of stimuli reconstructed from the short-time phase and magnitude spectra of noisy speech. The intelligibility of stimuli constructed from either the short-time magnitude spectrum or the short-time phase spectrum degrades at a similar rate under increasing noise levels. The intelligibility of the original signals under noisy conditions also degrades with increased noise, but in all cases the intelligibility is superior to that provided by the stimuli constructed from the separate short-time components. Therefore, we argue that knowledge of both short-time magnitude and phase spectrum information results in superior human speech recognition performance.

Read full abstract

Short-time Magnitude Spectrum Research Articles

Related Topics

Articles published on Short-time Magnitude Spectrum

Adaptive Weiner filtering with AR-GWO based optimized fuzzy wavelet neural network for enhanced speech enhancement.

On training targets for deep learning approaches to clean speech magnitude spectrum estimation.

Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing

Multiple F0 Estimation and Source Clustering of Polyphonic Music Audio Using PLCA and HMRFs

Improving Speech Recognition Rate through Analysis Parameters

Effect of Analysis Window Duration on Speech Intelligibility

Exploiting Conjugate Symmetry of the Short-Time Fourier Spectrum for Speech Enhancement

Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra

Significance of the Modified Group Delay Feature in Speech Recognition

Short-time phase spectrum in speech processing: A review and some experimental results

Further intelligibility results from human listening tests using the short-time phase spectrum

On the usefulness of STFT phase spectrum in human listening tests

Noise suppression by spectral magnitude estimation —mechanism and theoretical limits—

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Short-time Magnitude Spectrum Research Articles

Related Topics

Articles published on Short-time Magnitude Spectrum

Adaptive Weiner filtering with AR-GWO based optimized fuzzy wavelet neural network for enhanced speech enhancement.

On training targets for deep learning approaches to clean speech magnitude spectrum estimation.

Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing

Multiple F0 Estimation and Source Clustering of Polyphonic Music Audio Using PLCA and HMRFs

Improving Speech Recognition Rate through Analysis Parameters

Effect of Analysis Window Duration on Speech Intelligibility

Exploiting Conjugate Symmetry of the Short-Time Fourier Spectrum for Speech Enhancement

Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra

Significance of the Modified Group Delay Feature in Speech Recognition

Short-time phase spectrum in speech processing: A review and some experimental results

Further intelligibility results from human listening tests using the short-time phase spectrum

On the usefulness of STFT phase spectrum in human listening tests

Noise suppression by spectral magnitude estimation —mechanism and theoretical limits—