Incorporating Discriminative DPGMM Posteriorgrams for Low-Resource ASR

Bin Wu,Sakriani Sakti,Satoshi Nakamura

doi:10.1109/slt48900.2021.9383597

Abstract

The first step in building an ASR system is to extract proper speech features. The ideal speech features for ASR must also have high discriminabilities between linguistic units and be robust to such non-linguistic factors as gender, age, emotions, or noise. The discriminabilities of various features have been compared in several Zerospeech challenges to discover linguistic units without any transcriptions, in which the posteriorgrams of DPGMM clustering show strong discriminability and get several top results of ABX discrimination scores between phonemes. This paper appends DPGMM posteriorgrams to increase the discriminability of acoustic features to enhance ASR systems. To the best of our knowledge, DPGMM features, which are usually applied to such tasks as spoken term detection and zero resources tasks, have not been applied to large vocabulary continuous speech recognition (LVCSR) before. DPGMM clustering can dynamically change the number of Gaussians until each one fits one segmental pattern of the whole speech corpus with the highest probability such that the linguistic units of different segmental patterns are clearly discriminated. Our experimental results on the WSJ corpora show our proposal stably improves ASR systems and provides even more improvement for smaller datasets with fewer resources.

Full Text