Abstract

The performance of current speaker independent speech recognition technology is limited by the inadequacy of the measures of the speech data to discriminate between different speech sounds. In particular, two critical assumptions that underlie and limit most current recognition techniques are that: 1) speech data from different frames are statistically independent (e.g., there are no between-frame interactions); and 2) speech data statistics are independent of phonetic events (e.g., distance measures are fixed and independent of input or reference speech). In the context of speaker independent isolated digit recognition, improved recognition performance is demonstrated by: 1) explicitly modeling the correlation between spectral measurements of adjacent frames; and 2) using a distance measure which is a function of the recognition reference frame being used. A statistical model was created from a 2464 token database (2 tokens of each of 11 words zero through nine and oh) for 112 speakers. Primary features include energy and filter bank amplitudes. Interspeaker variability was estimated by time aligning all training tokens and creating an ensemble of 224 feature vectors for each reference frame. Normal distributions were then estimated individually for each frame jointly with its neighbors. Testing was performed on a multidialect database of 2486 spoken digit tokens collected from 113 (different) speakers using maximum-likelihood decision methods. The substitution rate dropped from 1.7 to 1.4 percent with incorporation of between-frame statistics, and further to 0.6 percent with incorporation of frame-specific statistics in the likelihood model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call