Abstract

A Concept-to-Speech (CTS) Generator is a system which integrates language generation with speech synthesis and produces speech from semantic representations. This is in contrast to Text-to-Speech (TTS) systems where speech is produced from text. CTS systems have an advantage over TTS because of the availability of semantic and pragmatic information, which are considered crucial for prosody generation, a process which models the variations in pitch, tempo and rhythm. My goal is to build a CTS system which produces more natural and intelligible speech than TTS. The CTS system is being developed as part of MAGIC (Dalal et al. 1996), a multimedia presentation generation system for health-care domain. My thesis emphasizes investigation and establishment of systematic methodologies for automatic prosody modeling using corpus analysis. Prosody modeling in most previous CTS systems employs handcrafted rules, with little evaluation of the overall performance of the rules. By systematically employing different machine learning techniques on a speech corpus, I am able to automatically model prosody for a given domain. Another focus of my thesis is on system architecture. There are two concerns when designing a CTS system: modularity and extensibility. The goal is to design a flexible CTS system so that new prosody generators, natural language generators and speech realization systems can be incorporated without requiring major changes to the existing system. Designing a CTS system to facilitate multimedia synchronization is another focus of this research. I have conducted initial investigations on different prosody models using a speech corpus collected from a medical domain. Different machine learning techniques were explored. For example, a classification based rule induction system and a generalized linear model are used in identifying and combining salient prosody indicators. Hidden Markov Models are also used to automatically derive probability models to predict a sequence of prosodic features from a sequence of language features. Preliminary results (Pan and McKeown 1998)

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.