Abstract
Described in this paper are the theoretical background and implementation of a speech synthesis engine that uses an index of features describing a natural speech source to provide pointers to waveform segments that can then be re-sequenced to form novel utterances. By efficiently labelling the features in speech that are minimally sufficient to describe the perceptually relevant variation in acoustic and prosodic characterisitcs, reduce the task of synthesis to ‘‘retrieval’’ rather than ‘‘replication,’’ and is reduced reuse of original waveform segments is possible without the need for (perceptually damaging) signal processing. The drawback of this system is that it requires a large corpus of natural speech from one speaker, but current improvements in data-storage devices and cpu technology have overcome this problem. The style of the corpus speech determines the style of the synthesis, but experiments with corpora of emotional speech confirm that by switching source corpora one can easily control the speaking style. By shifting ‘‘knowledge’’ out of the synthesizer into the source data an engine is produced that can work on any adequately labelled speech corpus. The interesting work for the future lies in determining which features are relevant to capture the variation in speech, and how they can be best described.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have