On the Performance of SCHMM for Isolated Word Recognition and Rejection

Carlos Teixeira,Isabel Trancoso,Antonio Serralheiro

doi:10.1007/978-3-642-57745-1_16

Abstract

A common problem with isolated word recognition systems arises when an untrained user speaks an unwanted word, outside the active vocabulary. This word will be recognised as one of the keywords, thus steering the dialogue into a wrong direction. The use of garbage or sink models (SM) is a known technique to avoid those extraneous words being recognised as vocabulary words. Each lexical word from the active vocabulary is represented in the recognition process by at least one word model (WM). A single SM intends to be a general description for a wide number of lexical items - all those which do not belong to the limited active vocabulary. Our previous work [3] has indicated that multiple SM’s can improve the rejection score when compared with a single SM in the context of a Continuous Hidden Markov Model (CHMM) with a single observation component. This improvement is related to the vocabulary size. For very small vocabularies, there are no advantages in using more than one SM, whereas for larger vocabularies, better results can be achieved with multiple models. When searching for the optimal number of multiple SM’s, an upper bound seems to be imposed by the available amount of speech training material. In fact, this amount should be particularly relevant for training sink models as they intend to represent the whole word universe (minus the small keyword vocabulary set). The parametric description provided by a single gaussian distribution is known to be a poor model for the observation probability density function (pdf). However, due to the restricted amount of speech training material, the use of multiple gaussian mixtures to describe the observation pdf’s did not improve our results. In the present work, we compare the performance of continuous and semi-continuous HMM (SCHMM) recognisers for dealing with the problem of word rejection. The latter type of recogniser has several advantages over the first one in cases of reduced training material which is indeed one of the critical factors in this study and in terms of computational complexity. This approach combines a common set of pdf’s in a codebook with the word or sub-word models themselves. The codebook and the models can be easily initialised and reestimated separately using different sets of training material or mutually optimised using the unified modelling approach described in [1]. Separate software tools for processing each stage of training and testing were developed providing a complete SCHMM recognition platform. In the present work some effort was also spent in finding how to combine the initialisation steps. The tests reported here enable us to compare CHMM and SCHMM while using multiple SM’s. Another issue to be addressed is the type and amount of speech material to be used for SM’s training. The discussion of HMM clustering techniques for selecting the speech material used to train each sink model in the context of multiple sink modelling is described in [2].

Full Text