Abstract

Several facets of the man–machine interface, such as speech synthesis and recognition in the spoken language realm, can be modeled using neural networks. Here neural networks have been applied to model the lexical prosodic parameters: segmental duration, loudness, and pitch, for the Finnish language. The prosodic models that were generated can be used in currently viable applications such as speech synthesis to further improve their naturalness. The text input stream was first converted into a phoneme sequence from which the input representation for the nets was generated. Inputs included: phoneme position in word, number of phonemes in word, and context in terms of previous and future phonemes. Optimal input representations for each type of prosodic net were searched for by varying the size of the input vector. The number of hidden nodes was also varied to determine the complexity of the problem. Estimating duration required class specific nets for the error to drop below 20%, the difference limen. For loudness it was 2.2 phon (1 phon is just noticeable), while pitch networks performed well with an error of 3.5% (equals 0.6 semitones at 100 Hz which is less than the 1.5 semitone perceptual intonation threshold).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.