The demand for telecommunications applications of automatic speech recognition has exploded in recent years. This area seems a natural candidate for speech recognition systems, since it embraces a tremendous variety of applications that rely entirely on audio signals and serial interfaces. However, the telecommunications environment strains the capabilities of current technology, given its broad range of uncontrollable variables, from speaker characteristics to telephone handsets and line quality. Current recognition systems have attained impressive performance levels on relatively controlled tasks, such as speaker-independent continuous digit recognition on laboratory databases comprising a few hundred speakers [l-3]. To comprehend the additional challenges of the telecommunications environment, we must study the effects on recognition of handset and channel characteristics, speaker accent, speaking style, and lexicon, as well as the interactions among these factors. No small amount of data will suffice to model these conditions. Simultaneous with the explosion of telecommunications applications has been the introduction of powerful statistical modeling techniques, known as hidden Markov models (HMMs), to speech recognition [4,5]. These computationally intensive algorithms introduce a large number of degrees of freedom into the speech recognition problem and hence exhibit slow convergence properties. As a consequence, they require orders of magnitude more training data than the previous generation of deterministic techniques. Many databases collected in the mid-l%Os, such as the DARPA Resource Management database [6] and the TIMIT Acoustic Phonetic database [7], while ambitious programs in their own right, have proven to consistently underrepresent important dimensions in HMM recognition systems due to their limited coverage. The Voice Across America (VAA) database being collected at Texas Instruments is designed to satisfy the data requirements of this next generation of speech recognition systems. Our goal is to collect data over standard long-distance telephone lines from 100,000 speakers representing a demographically and geographically balanced sample of the contiguous United States. This database will provide the foundation for a thorough investigation of factors affecting speaker-independent continuous speech recognition for American English. Similar projects are being planned for other countries, and will form the basis for research into recognition of Japanese, British English, and European languages. As of now, we have completed two phases of the VAAproject for a total of 50,000 utterances from nearly 3700 speakers. This paper describes the methods and motivation for VAA data collection and validation procedures, the current contents of the database, and the results of exploratory research on a 1088-speaker subset of the database. Our initial results underscore the need for an extensive database: even 1088 speakers-a large database by traditional standards-are insufficient to adequately represent the many dimensions of interest. One of our purposes here is to share the insights we have gained into telephone-based data collection, in the belief that the VAA model is likely to become the standard method of collecting data over the tele-
Read full abstract