Abstract

Combining multiple estimators to obtain a more accurate final result is a well-known technique in statistics. In the domain of speech recognition, there are many ways in which this general principle can be applied. We have looked at several ways for combining the information from different feature representations, and used these results in the best-performing system in last year’s Aurora evaluation: Our entry combined feature streams after the acoustic classification stage, then used a combination of neural networks and Gaussian mixtures for more accurate modeling. These and other approaches to combination are described and compared, and some more general questions arising from the combination of information streams are considered. Introduction Despite the successful deployment of speech recognition applications, there are circumstances that present severe challenges to current recognizers – for instance, background noise, reverberation, fast or slow speech, unusual accents etc. In the huge body of published research there are many reports of success in mitigating individual problems, but fewer techniques that are of help in multiple different conditions. What is needed is a way to combine the strengths of several different approaches into a single system. One thread of research at ICSI has been the development of novel representations of the speech signal to use as features for recognition. Often these are related to aspects of the auditory system, such as the short-term adaptation of RASTA (Hermansky & Morgan 1994) and the 2-16 Hz modulation frequency sensitivity of MSG (Kingsbury 1998). We typically find that each feature type has particular circumstances in which it excels, and this has motivated our investigations into methods for combining separate feature streams into a single speech recognition system. A related question arises when comparing different basic recognition architectures. ICSI has pioneered the “hybrid connectionist” approach to speech recognition, using neural networks in place of the conventional Gaussian mixture estimators as the recognizer’s acoustic model (Morgan & Bourlard 1995). Neural networks have many attractive properties such as their discriminative ability to focus on the most critical regions of feature space, and their wide tolerance of correlated or non-Gaussian feature statistics. However, many tools and techniques have been developed for Gaussian-mixture-based systems that cannot easily be transferred to the connectionist approach. We were therefore also interested in techniques for combining architectures, to create a single system that could exploit the benefits of both the connectionistand Gaussian-mixture-based systems. Combination techniques also offer various practical advantages. International collaborations are central to ICSI’s charter, and good mechanisms for combining relatively independent systems have made it possible for us to build single recognition systems that combine acoustic models trained at ICSI with those from our European collaborators. At a smaller scale, being able to construct systems from relatively-independent pieces without having to retrain the entire assembly can significantly increase the overall complexity of the recognizers we can practically produce. This paper briefly reviews some of the theoretical attractions of combination, then surveys several practical realizations of these advantages, many arising from projects conducted at ICSI. We conclude with a discussion of some of the outstanding research issues in combination-based speech recognition systems. The justification for combinations Combination is a well-known technique in statistics. For instance, if we have several different ways to obtain an estimate of the same underlying value, a better estimate can usually be obtained by combining them all, for instance by averaging. The key requirement for this to be beneficial is that the ‘noise’ in each estimator (i.e. the difference between the estimated and true values) should be uncorrelated. If this is true, for instance if the estimators are based on different measurement techniques subject to different sources of error, then on average the errors will tend to cancel more often than they reinforce, so an optimal combination will improve accuracy. An example of this principle in action is the system of Billa et al. (1999). They combined three nearly identical sets of speech features, the only difference being that the analysis frame rate varied between 80 and 125 samples per second. Although all the feature sets were using the same representation, the slight differences in how the signal was framed were enough to introduce some decorrelation between the errors in the streams, and the combined system performed significantly better than any of the component streams. For the neural network models used at ICSI, another way to get different estimators from a single feature representation is to train the networks based on different random starting conditions, and we have seen some small benefits from this approach (Janin et al. 1999). However, using a pair of networks based on the same features did not perform as well as training a single network of twice the size. In speech recognition, we can almost always get a marginal improvement simply by increasing the number of model parameters, so combination schemes need to offer some advantages (in terms of performance or practicalities) in comparison to this simplest approach. As mentioned above, our experience in practice is that certain processing will perform particularly well on certain subsets of the data. If we could figure out when this is the case, either because the estimator has some measure of how well it is doing, or because we have a separate classifier telling us which model is likely to work best, then we would expect to be able to make a more successful combination of the information. This “mixture of experts” approach has also been widely investigated, but for speech recognition it often proves as difficult to classify the data into different domains of expertise as to make the overall classification into speech sounds [Morris 2000]. However, a combination system that somehow tends to de-emphasize the poorly-performing models will be preferable to unweighted averaging. Figure 1 illustrates some of the ways in which two feature streams might be combined in a recognizer. If we break the speech recognition into feature calculation, acoustic classification and HMM decoding, the streams could be combined after any of these stages. Specific examples of each of these three possibilities are discussed below. Feature combinations Our first forays into feature combination came during the early developments of MSG features (Kingsbury & Morgan 1997). A set of novel features gave a word error rate (WER) twice as large as the standard baseline features, yet when the two systems were combined by simply multiplying the posterior probability estimates for each phone, combining with the weaker features affected a 15% relative WER reduction on our baseline features. Such “posterior combination” via multiplication remains the most successful combination scheme we have found at this level; a possible explanation for this success is that it carries an element of the “mixture of experts” approach mentioned above: If a poorly-performing classifier becomes ‘equivocal’ so that all possible classes are given roughly equal probabilities, then it will have little or no net effect when combined via multiplication with a more confident model; the weaker model will be discounted by the combination. We have since used posterior combination in a variety of situations, including the 1998 DARPA/ NIST Broadcast News evaluation system from the SPRACH project, a collaboration with Cambridge and Sheffield universities (Cook et al. 1999). Most recently, we applied the approach to the 1999 Aurora task (Pearce 1998); this was one of the key elements in the best-performing system in that evaluation, a collaboration between ICSI, the Oregon Graduate Institute, and Qualcomm (Sharma et al. 2000). Our results in this task were obtained via posterior combination (PC) – using separate acoustic classifier models for each feature stream and combining the posterior probability outputs. For completeness, we made extensive comparisons with the simpler approach of feature combination (FC) – i.e. merging the two feature streams and classifying them jointly within a single, larger model. We found a complex pattern of results, with the best approach depending on which feature streams were to be combined (Ellis 2000); some of the results are shown in table 1. We argued that feature combination was better suited to feature streams that showed informative co-dependence, and posterior combination was more appropriate when the feature spaces were closer to independent. (In fact, posterior combination by multiplication is very close to the optimal strategy for two streams that are conditionally mutually independent). Feature 1 calculation Acoustic classifier 1 Feature 2 calculation Acoustic classifier 2 Features Phone probabilities Word sequences Feature combination Input sound HMM decoder 1 HMM decoder 2 Posterior combination Hypothesis combination Figure 1: Alternative combination strategies for a recognition system based on two feature streams. The information may be combined after feature calculation, after classification, after decoding, or by some combination of these strategies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call