A study on the consistency analysis of energy parameter for Mandarin speech

Li-Te Shen,Cheng-Yu Yeh,Shaw-Hwa Hwang

doi:10.1186/1687-4722-2012-28

Abstract

Abstract In this study, a consistency analysis of energy parameter for Mandarin speech is presented. Identified as a result of inspection of the human pronunciation process, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the hidden Markov model (HMM) algorithm is used first to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Second, based on a designated syllable, the vector quantization (VQ) with the Linde–Buzo–Gray algorithm is used to train the VQ codebooks of each segment. Third, the energy vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the energy warping process intra a syllable must be considered in a text-to-speech system to improve the synthesized speech quality.

Highlights

A text-to-speech (TTS) system [1,2,3,4,5,6] is a system converting a text input into a speech output, and applied to smart human computer interfaces and auxiliary speech systems for the visual impaired
One is the corpusbased synthesis units [10,11,12,13] and the other is the small footprint synthesis units approaches [4,14,15,16,17,18]. This corpus-based speech synthesis technique relies on a unit selection method and compilation of speech units from a large speech database
The selection of the units aims to cover as many units as possible in different phonetic and prosodic contexts in order to provide the necessary variability in the synthetic speech output

Summary

Introduction

A text-to-speech (TTS) system [1,2,3,4,5,6] is a system converting a text input into a speech output, and applied to smart human computer interfaces and auxiliary speech systems for the visual impaired. Reviewing the history of TTS technology development, the waveform-based synthesis units approach [11,12,13,14,15,16,17,18] is one of the most commonly used technology in TTS. The selection of the units aims to cover as many units as possible in different phonetic and prosodic contexts in order to provide the necessary variability in the synthetic speech output. This approach requires a great number of speech units, i.e., a large deal of storage space is needed to reach a superior speech quality

Methods

Results

Conclusion