A spectro-temporal analysis of speech based on nonlinear operators

Alain Migneault,Jean Rouat,Sylvain Lemieux

doi:10.21437/icslp.1992-405

Abstract

This paper proposes a spectro-temporal analysis based on a bank of cochlea filters in combinaison with a nonlinear operator for amplitude modulation enhancement in the medium and high frequency formants. The output of the spectrotemporal analysis is represented as a 3D image where it is possible to observe very short-term speech transitions and formant modulations. With such analysis, it is possible to obtain patterns characteristics of phonemes and transitions betwen phonemes, which can not be obtained by using other speech analysis ( FFT, LPC ) techniques. The paper presents 3D images of vowels, where the amplitude modulation of the formants is clearly visable. 1.INTRODUCTION Research in speech analysis is recognized to be an important field in the area of speech processing, with applications in speech coding, speech recognition, etc. Depending on the application, the speech analyzer has to extract the most appropriate parameters. This paper proposes an analysis to enhance the modulation properties in speech. The automatic of speech with non-linear operators based on perceptive knowledge is a problem which has not yet been fully addressed, and speech might assist the researcher in understanding speech and / or in the design of an efficient speech analysis. 2.MODULATED TONE PERCEPTION Since the auditory system does not resolve the high frequency components, the temporal features of vowel-like sounds are coded similarly as those of amplitude-modulated tones. Furthermore, research work on automatic demodulation of speech can be motivated by the hypothesis proposing that the human brain has neural cells which specialize in Amplitude Modulation (AM) and Frequency Modulation (FM) detection [3][18][19]. More recently, Schreiner and Langner [6][17] have studied the representation of amplitude modulation in the inferior colliculus of cats and have shown that the inferior colliculus of the cat contains a highly systematic topographic representation of amplitude modulation paremeters. 3.BASILAR MEMBRANE NONLINEARITIES Nonlinearity and perception of intermodulation distortion products (f1-f2, 2f1-f2, etc.) are a pressing issue in hearing research and it is not easy to understand exactly the origin of these nonlinearities. Recently Robles, Ruggero and Rich have observed distortion products on chincilla basilar membrane b y using a laser-velocimetry technique [16]. Their work suggests that the lived basilar membrane is a nonlinear system and thus, the perception of distortion products could be due to the basilar membrane response and not only to the neural postprocessing. 4.NONLINEAR SPEECH PROCESSING Non linear processing The proposed analysis attempts to consider the automatic demodulation of the signal, before it is transformed in neural pulses in the cochlea. In fact, we will show than nonlinear operations of the signal create distorsion products and can enhance the modulation properties of the signal. Two nonlinear operators will be included in the proposed analysis to enhance the Amplitude Modulation observed with vowels. Nonlinear filtering seems to be very attractive and much work has been done in that field, refer to [9] for examples. More recently, one can cite the work by P. Maragos et al [7] where it is shown that the nonlinear operator, called Teager energy operator [5], allows AM and FM demodulation. Furthermore, L. Atlas and J. Fang [1] have shown that quadratic detectors allow for a better representation of speech in the context of a noisy pitch tracker. The originality of the present work resides in the combination of a perceptive bank of filters with nonlinear operators to obtain a 3D representation of speech with the A M information enhanced. Nonlinear operators J.F. Kaiser [5] proposes the Teager energy operator as beeing able to extract the energy of a signal based on mechanical and physical considerations. It has been shown [7] that this operator is able to track either the amplitude of an A.M. signal or the frequency of an FM signal. Another nonlinear operator has been proposed [14] to take into consideration the changes in the instantaneous signal power in the cochlea. This operator, called Dyn, shows the ability to enhance the AM-FM modulation in speech. Generally speaking, nonlinear operators are simple tools with the ability to modify the signal spectrum by combining the spectrum information. This ability is particularly interesting for AM or FM demodulation and for spectrum shifting, which are not easy to perform with standart linear techniques. Figure 1 illustrates the output of the Teager Energy and Dyn operators for two tones. The first section of figure 1 is a 600Hz tone, the second section is a 1000Hz tone. Section 3 is the sum of the 600Hz and 1000Hz tones. Sections 4 and 5 are respectively the output of the Dyn and Teager energy operators for the signal from section 3. Let us consider the combination tone defined as : s(t) = A1cos (w1t) + A2cos (w2t). By using the analog version of the Teager energy operator [4], one can show that: Teager [s(t)] = (A1w1)2 + (A2w2)2 + (A1A2)( w12!+!w22 2 w1w2) . cos[(w1 + w2) t] + (A1A2)( w12!+!w22 2 + w1w2) . cos[(w1 w2) t] (1) The amplitude difference between the two tones from the Teager output is equal to 2A1A2w1w2. Therefore, the w1 w2 component will largely dominate in comparison with the w1 + w2 component, as it is observed in section 5 from figure 1 where w1 w2 = 2p(1000-600) rad/s. Similarly, by using the analog form of the Dyn operator [13], one can show that : Dyn[s(t)] = A12!w1 2 sin (2w1t) A22!w2 2 sin (2w2t) A1A2( w1!+!w2 2 ) sin [(w1 + w2)t] A1A2( w1!-!w2 2 ) sin [(w1 w2)t] (2) By comparing equation (2) with figure 1, we observe that the component w1 + w2 = 2p(1000+600) rad/s is predominant in the output of Dyn for the composite signal. In summary, nonlinear operators are simple and powerful tools to obtain distorsion components from a sum of pure tones and might be used in speech processing where, some of the distorsion components might be perceptively important. 5.THE ANALYSIS OF SPEECH In this section, we will describe how a perceptive filterbank has been used in conjonction with the Teager energy or Dyn operators to generate a 3D representation of speech where the amplitude modulation of formants has been enhanced. Filtering The actual version of the analyzer is comprised of a bank of twenty-four filters centred on 330Hz to 4700Hz. These filters partially simulate the frequency analysis performed by the cochlea. These are rounded exponential filters with the Equivalent Rectangular Bandwidths (ERB) proposed b y Patterson [11] and Moore and Glasberg [10]. The output of each filter is a bandpass signal with a narrow-band spectrum centred around fi where fi is the central frequency (C.F.) of channel i. According to communication theory [2] the output signal si(t) from channel i can be considered to have been modulated in amplitude and phase with a carrier frequency of fi. si(t) = Ai(t) cos [wit+fi(t)] (3) Ai(t) is the modulating amplitude and fi(t) is the modulating phase. It should be noticed that equation (3) is true only for a bandpass signal (bandwidth of Ai(t) and fi(t) small in

Full Text