Abstract
There are two approaches for constructing an appropriate fundamental frequency (F0) control method for speech synthesis: statistical and rule-based. The statistical approach has the advantage of automatic training, but it requires a large corpora of speech that is annotated with prosodic boundaries. Recently, a method is proposed for high-accuracy detection of these boundaries [Ostendorf and Ross (1996)], given a set of prosodic boundary candidates in which almost all the correct boundaries are included. This paper proposes a detection method to generate these boundary candidates, specifically for accentual phrases which represent one of the smallest prosodic units. The detection algorithm uses local maximums and minimums of the F0 contour and low-energy regions of the speech waveform for finding candidate regions that correspond to accentual phrases and pauses in speech. The candidate phrase boundaries are then aligned to the nearest phoneme boundaries, which are detected automatically using forced alignment with a speaker-independent speech recognition system given a phoneme transcription. This method was applied to 250 read Japanese sentences. High-detection accuracy (97%) was obtained, with almost all the missed detections having valid candidates within ±3 phonemes. The insertion error rate was less than double the number of correct boundaries.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.