Multilingual speech databases at LDC

John J Godfrey

doi:10.3115/1075812.1075819

Abstract

As multilingual products and technology grow in importance, the Linguistic Data Consortium (LDC) intends to provide the resources needed for research and development activities, especially in telephone-based, small-vocabulary recognition applications; language identification research; and large vocabulary continuous speech recognition research.The POLYPHONE corpora, a multilingual database of databases, are specifically designed to meet the needs of telephone application development and testing. Data sets from many of the world's commercially important languages will be available within the next few years.Language identification corpora will be large sets of spontaneous telephone speech in several languages with a wide variety of speakers, channels, and handsets. One corpus is now available, and current plans call for corpora of increasing size and complexity over the next few years.Large vocabulary speech recognition requires transcribed speech, pronouncing dictionaries, and language models. To fill this need, LDC will use the unattended computer-controlled collection methods developed for SWITCH-BOARD to create several similar corpora, each about one-tenth the size of SWITCHBOARD, in other languages. Text corpora sufficient to create useful language models will be collected and distributed as well. Finally, pronouncing dictionaries covering the vocabulary of both transcripts and texts will be produced and made available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multilingual speech databases at LDC

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Future vector enhanced LSTM language model for LVCSR
Qi Liu ... Yanmin Qian
-
Qi Liu, et. al.Qi Liu ... Yanmin Qian
01 Dec 2017
01 Dec 2017

Japanese large-vocabulary continuous-speech recognition using a newspaper corpus and broadcast news
Katsutoshi Ohtsuki ... Katsuhiko Shirai
Speech Communication | VOL. 28
Katsutoshi Ohtsuki, et. al.Katsutoshi Ohtsuki ... Katsuhiko Shirai
01 Jun 1999
Speech Communication | VOL. 28

A Language Model Optimization Method for Turkish Automatic Speech Recognition System
Saadin Oyucu ... Hüseyin Polat
Politeknik Dergisi | VOL. 26
Saadin Oyucu, et. al.Saadin Oyucu ... Hüseyin Polat
01 Oct 2023
Politeknik Dergisi | VOL. 26

Evaluation of smoothing techniques for language modeling in automatic filipino speech recognition
Federico M Ang ... Juan Carlo Miguel C Ancheta
-
Federico M Ang, et. al.Federico M Ang ... Juan Carlo Miguel C Ancheta
01 Nov 2012
01 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multilingual speech databases at LDC

Abstract

Talk to us

Similar Papers