Most speech coding algorithms operating at rates of around 8 kb/s attempt to reproduce the original speech waveform. Their efficiency of reproducing the waveform is obtained by using models which exploit knowledge of the generation of the speech signal. In contrast, most coders operating at rates of around 2.4 kb/s are completely parametric, usually transmitting parameters describing the pitch and the spectral envelope at regular intervals. However, because of model inadequacies, the quality of reconstruction of current parametric methods never reaches that of the original signal, even at high bit rates. In this paper, a new method which is positioned between the waveform coders and the parametric coders is presented. It is based on the assumption that, for voiced speech, a perceptually accurate speech signal can be reconstructed from a description of the waveform of a single, representative pitch cycle per interval of 20-30 ms. Figure 1 shows the smooth evolution of the shape of the pitch cycle, which is typical for voiced speech signals. We will show how such a signal can be reconstructed by interpolatingprototype pitch cycles between the updates. The prototypewaveform interpolation (PWI) method retains the natural quality typical of coders which encode the entire waveform, but requires a bit rate close to that of the parametric coders. We discuss PWI methods based on linear prediction (LP). In LP-based speech coders, the signal is reconstructed from knowledge of the predictor coefficients and a description of the excitation signal. Of the existing LP-based algorithms, the code-excited linear-prediction (CELP) algorithm [l] and the LP vocoder [ 2 ] are examples of waveform and parametric coders, respectively. In the simplest form of CELP the speech waveform is described by time-varying LP filter coefficients and a filter excitation consisting of the concatenation of scaled fixed-length vectors from a codebook. To achieve high efficiency during voiced speech, most implementations include a long-term predictor [ 3 1, or adaptive codebook [ 41, to facilitate periodicity of the reconstructed signal. Despite recent improvements [ 5,6], inaccurate reproduction of the periodicity remains the main source of perceptual distortion in the current CELP algorithms at rates below 6 kb/s. In the LP-based vocoders the voiced speech signal is modeled by a single pulse per pitch cycle. Because of excessive periodicity, this often leads to a buzzy character of the reconstructed speech. Recent work has shown that the speech quality can be improved significantly by adding more information about the evolving waveform shape. Using a cluster of pulses for each pitch cycle, with blockwise shape adaptation, in combination with a smoothly varying overall gain produced good results [ 71. Alternatively, good-quality voiced speech can be obtained at rates of around 3 kb/s by careful placement of the single-pulse locations [ 8,9]. Although significantly improved over the LP-based vocoders, and similar in quality to 4.8 kb/s CELP, such single-pulse excited ( SPE) speech coders still suffer from some buzziness. Both the CELP and the SPE methods attempt to reproduce the original waveform by using a (spectrally weighted) signal-to-noise ratio (SNR) of the reconstructed speech signal as a criterion to determine the excitation sequence. However, maintaining the periodicity of the original speech signal is important for its perceptual quality, and maximization of the SNR often leads to a nonoptimal degree of periodicity. Thus, it was found in both the CELP [ 61 and the SPE coders [ 91 that improved speech quality can be obtained by increasing the periodicity, despite an associated reduction in SNR.
Read full abstract