Modeling of pitch, loudness, and segmental durations in Finnish using neural networks

Toomas Altosaar,Martti Vainio,Matti Karjalainen

doi:10.1121/1.416764

Abstract

Several facets of the man–machine interface, such as speech synthesis and recognition in the spoken language realm, can be modeled using neural networks. Here neural networks have been applied to model the lexical prosodic parameters: segmental duration, loudness, and pitch, for the Finnish language. The prosodic models that were generated can be used in currently viable applications such as speech synthesis to further improve their naturalness. The text input stream was first converted into a phoneme sequence from which the input representation for the nets was generated. Inputs included: phoneme position in word, number of phonemes in word, and context in terms of previous and future phonemes. Optimal input representations for each type of prosodic net were searched for by varying the size of the input vector. The number of hidden nodes was also varied to determine the complexity of the problem. Estimating duration required class specific nets for the error to drop below 20%, the difference limen. For loudness it was 2.2 phon (1 phon is just noticeable), while pitch networks performed well with an error of 3.5% (equals 0.6 semitones at 100 Hz which is less than the 1.5 semitone perceptual intonation threshold).

Full Text