Text-directed speech enhancement employing phone class parsing and feature map constrained vector quantization

John H.L Hansen,Bryan L Pellom

doi:10.1016/s0167-6393(97)00003-4

Abstract

There are many situations where non-real-time speech enhancement is required. For such applications, employing any available a priori knowledge can lead to more effective enhancement solutions. In this study, a novel text-directed speech enhancement algorithm is developed for usage in non-real-time applications. In our approach, the text of the intended dialogue is used to partition noisy speech into regions of broad phoneme classifications. Classes considered include stops, fricatives, affricates, nasals, vowels, semivowels, diphthongs and silence. These partitions are then used to direct a new vector quantizer based enhancement scheme in which phone-class directed constraints are applied to improve speech quality. The proposed algorithm is evaluated using both objective as well as subjective quality assessment techniques. It is shown that the text-directed approach improves the quality of the degraded speech over a broad range of noise sources (i.e., flat communications channel noise, aircraft cockpit noise, helicopter fly-by noise, and automobile highway noise) and over a broad range of signal-to-noise ratios (i.e., 10, 5, 0 and −5 dB). In each case, the proposed method is shown consistently to exhibit improved objective quality over linear and generalized spectral subtraction, as well as the Auto-LSP constrained iterative enhancement method using the Itakura-Saito measure and a 100-sentence evaluation speech corpus. Subjective quality assessment was conducted in the form of an A-B comparison test. Results of these evaluations demonstrate that, for wideband noise distortions, the proposed algorithm is preferred over the unprocessed noisy speech more than 2 to 1, while the proposed algorithm is preferred over spectral subtraction by more than 3 to 1.

Full Text