Pitch-synchronous frame-by-frame and segment-based articulatory analysis by synthesis

Sunil K Gupta,Juergen Schroeter

doi:10.1121/1.407364

Abstract

This paper presents a pitch-synchronous analysis-by-synthesis procedure for estimating model parameters for voiced speech. These model parameters describe the vocal-tract shape and the time derivative of the glottal area function. The excitation waveform is derived from the glottal area function by incorporating source-tract interaction using the current vocal-tract input impedance. The corresponding analysis procedure for estimating the model parameters once every pitch period is outlined. A significant improvement in quality was obtained for the new pitch-synchronous analysis/synthesis procedure relative to the fixed-frame-length-based scheme used previously. It was also found that the new pitch-synchronous articulatory analysis/synthesis scheme achieves lower rms spectral distortion values than the 2.4 kb/s. Federal standard LPC-10E algorithm. A segment-based procedure for estimating the vocal-tract model parameters at a rate much lower than the current pitch is described. In this segment-based analysis-by-synthesis approach, the model parameters are estimated every 50–100 ms. The parameters for the intermediate pitch periods are derived by interpolation. The segments are selected using a maximum likelihood segmentation algorithm that segments an utterance into diphonelike units. A segment-based parameter optimization scheme could lead to a highly economical representation of the speech signal for potential applications in very low bit rate speech coding and speech storage. The above schemes were optimized for a pilot test sentence and then evaluated using eight test sentences for a log area and the Coker articulatory model representation of the vocal tract. Nine listeners were asked to judge the quality of the synthesis in a paired-comparison test and the results were analyzed using a nonparametric one-tailed sign test. For the log-area representation of the vocal tract, we found a significant degradation in speech quality for the segment-based optimization procedure relative to the frame-based procedure. However, for the Coker model representation, the degradation was found to be insignificant. This shows that unlike cross-sectional areas, the movement of various articulators in the vocal tract during speech production can be described with sufficient accuracy by specifying the position of these articulators and by using an interpolation function at time intervals much longer than a pitch period.

Full Text