Abstract
AbstractThe full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous research in automatic speech recognition converted this very rich representation into the equivalent of a sequence of short-time power spectra, mainly to simplify the computation of the posterior probability that a frame of an unknown speech signal is related to a specific state. In this paper we use the raw output of a modulation spectrum analyser in combination with sparse coding as a means for obtaining state posterior probabilities. The modulation spectrum analyser uses 15 gammatone filters. The Hilbert envelope of the output of these filters is then processed by nine modulation frequency filters, with bandwidths up to 16 Hz. Experiments using the AURORA-2 task show that the novel approach is promising. We found that the representation of medium-term dynamics in the modulation spectrum analyser must be improved. We also found that we should move towards sparse classification, by modifying the cost function in sparse coding such that the class(es) represented by the exemplars weigh in, in addition to the accuracy with which unknown observations are reconstructed. This creates two challenges: (1) developing a method for dictionary learning that takes the class occupancy of exemplars into account and (2) developing a method for learning a mapping from exemplar activations to state posterior probabilities that keeps the generalization to unseen conditions that is one of the strongest advantages of sparse coding.
Highlights
Nobody will seriously disagree with the statement that most of the information in acoustic signals is encoded in the way in which the signal properties change over time and that instantaneous characteristics, such as the shape or the envelope of the short-time spectrum, are less important - though surely not unimportant
In experimenting with the AURORA-2 task, it is a pervasive finding that the results depend strongly on the word insertion penalty (WIP) that is used in the Viterbi back end
In this paper we set aside a small development set, on which we searched the WIP that gave the best results in the conditions with signal-to-noise ratio (SNR) ≤ 5 dB; in these conditions the best performance was obtained with the same WIP value
Summary
Nobody will seriously disagree with the statement that most of the information in acoustic signals is encoded in the way in which the signal properties change over time and that instantaneous characteristics, such as the shape or the envelope of the short-time spectrum, are less important - though surely not unimportant. The dynamic changes over time in the envelope of the short-time spectrum are captured in the modulation spectrum [1,2,3]. This makes the modulation spectrum a fundamentally more informative representation of audio signals than a sequence of short-time spectra. In this paper we are concerned with the use of modulation spectra for automatic speech recognition (ASR), noise-robust speech recognition. In this application domain, we cannot rely on the intervention of the human auditory system. It is necessary to automatically extract the information encoded in the modulation spectrum that humans would use to understand the message
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.