Abstract

This paper proposes two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system. The first feature is the punctuation confidence (PC), which measures the likelihood that a major punctuation mark (MPM) can be inserted at a word boundary. The second feature is the quotation confidence (QC), which measures the likelihood that a word string is quoted as a meaningful or emphasized unit. The proposed PC and QC features are influenced by the properties of automatic Chinese punctuation generation and linguistic characteristic of the Chinese punctuation system. Because MPMs are highly correlated with prosodic–acoustic features and quoted word strings serve crucial roles in human language understanding, the two features could potentially provide useful information for prosody generation. This idea was realized by employing conditional random-field-based models for predicting MPMs, quoted word string locations, and their associated confidences—that is, PC and QC—for each word boundary. The predicted punctuations and their confidences were then combined with traditional linguistic features to predict prosodic–acoustic features for performing speech synthesis using multilayer perceptrons. Both objective and subjective tests demonstrated that the prosody generated with the proposed linguistic features was superior to that generated without the proposed features. Therefore, the proposed PC and QC are identified as promising features for Mandarin prosody generation.

Highlights

  • Prosody generation serves a crucial role in a text-to-speech system (TTS)

  • In hidden Markov model (HMM)-based synthesis, the most popular speech synthesis method [7,8,9,10], prosodic–acoustic features are modeled at the HMM state level, that is, modeled using the state duration, state log-F0 value, and energy contour enclosed by the spectral parameters

  • The CRF-based major punctuation mark (MPM) generator and the CRF-based quotation generator, potentially can be robustly trained by using a large text corpus to provide useful prosodic information that is highly correlated with major punctuations and quoted phrases

Read more

Summary

Introduction

Prosody generation serves a crucial role in a text-to-speech system (TTS). Prosody generation can be regarded as a function mapping from linguistic features to prosodic structures or prosodic–acoustic features. An automatic punctuation generation model that predicts MPMs and is trained by using a large text corpus can learn punctuation strategies for predicting MPMs from various contributors for providing useful cues for predictions of both prosodic breaks [28, 31] and prosodic–acoustic features [29,30,31]. The CRF-based quotation generation model predicts the structure of a quoted word string (hereafter referred to as the quoted phrase, or QP) from the bracket-removed word or POS sequences and calculates the associated confidence, which is referred to as the QC. The PC and QC were conveniently determined from the features of word or POS sequences robustly obtained by performing segmentation of the current word and employing POS-tagging technologies without using complicated statistical syntactic parsing This advantage makes the proposed approach suitable for practical online unlimited TTS. This section analyzes the quoted phrases in the ASBC text corpus, identifying possible QC candidates for the training of the CRF-based quotation model

Section 3: Construction of the CRF-based MPM generation model
Section 5: Prosody generation experiments
Section 6: Conclusions and future work
Proposed PC
Design of prediction targets
B B2 B3 M M E
Advanced feature set—PCs and QCs
Findings
Conclusions and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.