The Mel-frequency cepstral coefficient (MFCC) parameterization for automatic speech recognition (ASR) utilizes several perceptual features of the human auditory system, one of which is the static compression. Motivated by the human auditory system, the conventional static logarithmic compression applied in the MFCC is analyzed using psychophysical loudness perception curves. Following the property of the auditory system that the dynamic range compression is higher in the basal regions than the apical regions of the basilar membrane, we propose a method of unequal (asymmetric) compression, i.e., higher compression applied in the higher frequency regions than the lower frequency regions. The methods is applied and tested in the MFCC and the PLP parameterizations in the spectral domain, and the ZCPA auditory model used as an ASR front-end in the temporal domain. The extent of the asymmetric compression is applied as a multiplicative gain to the existing static compression, and is determined from the gradient of the piece-wise linear segment of the perceptual compression curve. The proposed method has the advantage of adjusting compression parametrically for improved ASR performance and audibility in noise conditions by low-frequency spectral enhancement, particularly of vowels with lower <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F</i> 1 and <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F</i> 2 formants. Continuous-density HMM recognition using the Aurora 2 corpus and the TIdigits show performance improvements in additive noise conditions.
Read full abstract