We report on some recent improvements to an HMM-based, continuous speech recognition system which is being developed at AT&T Bell Laboratories. These advances, which include the incorporation of inter-word, context-dependent units and an improved feature analysis, lead to a recognition system which gives a 95% word accuracy for speaker-independent recognition of the 1000-word DARPA resource management task using the standard word-pair grammar (with a perplexity of about 60). It will be shown that the incorporation of inter-word units into training results in better acoustic models of word juncture coarticulation and gives a 20% reduction in error rate. The effect of an improved set of spectral and log-energy, features is further to reduce word error rate by about 30%. Since we use a continuous density HMM to characterize each subword unit, it is simple and straightforward to add new features to the feature vector (initially a 24-element vector, consisting of 12 cepstral and 12 delta cepstral coefficients). We investigate augmenting the feature vector with 12 second difference (delta-delta) cepstral coefficients and with first (delta) and second difference (delta-delta) log energies, thereby giving a 38-element feature vector. Additional error rate reductions of 11% and 18% were achieved, respectively. With the improved acoustic modeling of subword units, the overall error rate reduction was over 42%. We also found that the spectral vectors, corresponding to the same speech unit, behave differently statistically, depending on whether they are at word boundaries or within a word. The results suggest that intra-word and inter-word units should be modeled independently, even when they appear in the same context. Using a set of subword units which included variants for intra-word and inter-word, context-dependent phones, an additional decrease of about 6–10% in word error rate resulted.
Read full abstract