Articulatory phonetics describes speech as a sequence of overlapping articulatory gestures, each of which may be associated with a characteristic ideal target spectrum. In normal speech, the idealized target gestures for each speech sound are often never attained, and the speech signal exhibits only transitions between such (implicit) targets. It has been suggested that the underlying speech sounds can only be recovered by reference to detailed knowledge of the gestures by which individual speech sounds are produced. It will be shown that it is possible to decompose the speech signal into overlapping “temporal transition functions” using techniques which make no assumptions about the phonetic structure of the signal or the articulatory constraints used in speech production. Previous work has shown that these techniques can produce a large reduction in the information rate needed to represent the spectral information in speech signals [B.S. Atal, Proc. ICASSP 83, 2.6, 81–84 (1983)]. It will be shown that these methods are able to derive speech components of low bandwiths that vary on a time scale closely related to traditional phonetic events. Implications for perception and the application of such techniques both for speech coding and as a possible front end for speech recognition will be discussed.
Read full abstract