Abstract

State-of-the-art automatic speech recognition systems (ASRs) use only the short-time magnitude spectrum for feature extraction; the short-time phase spectrum is generally ignored in these systems. Results from our recent human listening tests indicate that the short-time phase spectrum can significantly contribute to speech intelligibility over small window durations (i.e., 20–40 ms). This is an interesting result, indicating the possible usefulness of the short-time phase spectrum for ASR, which commonly employs small window durations of 20–40 ms for spectral analysis. In this paper, we continue our investigation of the short-time phase spectrum. We explore the use of partial short-time phase spectrum information, in the absence of all the short-time magnitude spectrum information, for intelligible signal reconstruction. We create two types of stimuli; one in which its frequency-derivative (i.e., group delay function, GDF) is preserved and another in which its time-derivative (i.e., instantaneous frequency distribution, IFD) is preserved. We do this to determine the contribution that each of these derivatives provides toward intelligibility. Reconstructing stimuli from knowledge of only the GDF or only the IFD results in poor intelligibility. However, when we create stimuli using knowledge of both the GDF and the IFD, reasonable intelligibility is obtained. In light of these results, we conclude that both the GDF and IFD components of the short-time phase spectrum are needed to reconstruct an intelligible signal. In addition, we also perform some experiments to quantify the intelligibility of stimuli reconstructed from the short-time phase and magnitude spectra of noisy speech. The intelligibility of stimuli constructed from either the short-time magnitude spectrum or the short-time phase spectrum degrades at a similar rate under increasing noise levels. The intelligibility of the original signals under noisy conditions also degrades with increased noise, but in all cases the intelligibility is superior to that provided by the stimuli constructed from the separate short-time components. Therefore, we argue that knowledge of both short-time magnitude and phase spectrum information results in superior human speech recognition performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call