Abstract

As multilingual products and technology grow in importance, the Linguistic Data Consortium (LDC) intends to provide the resources needed for research and development activities, especially in telephone-based, small-vocabulary recognition applications; language identification research; and large vocabulary continuous speech recognition research.The POLYPHONE corpora, a multilingual database of databases, are specifically designed to meet the needs of telephone application development and testing. Data sets from many of the world's commercially important languages will be available within the next few years.Language identification corpora will be large sets of spontaneous telephone speech in several languages with a wide variety of speakers, channels, and handsets. One corpus is now available, and current plans call for corpora of increasing size and complexity over the next few years.Large vocabulary speech recognition requires transcribed speech, pronouncing dictionaries, and language models. To fill this need, LDC will use the unattended computer-controlled collection methods developed for SWITCH-BOARD to create several similar corpora, each about one-tenth the size of SWITCHBOARD, in other languages. Text corpora sufficient to create useful language models will be collected and distributed as well. Finally, pronouncing dictionaries covering the vocabulary of both transcripts and texts will be produced and made available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.