Abstract

Our goal is to develop robust language models for speech recognition. These models have to predict a word knowing its history. Although the increasing size of electronic text data, all the possible word sequences of a language cannot be observed. A way to generate these non encountered word sequences is to map words in classes. The class-based language models have a better coverage of the language with a reduced number of parameters, a situation which is favourable to speed up the speech recognition systems. Two types of automatic word classification are described. They are trained on word statistics estimated on texts derived from newspapers and transcribed speech. These classifications do not require any tagging, words are classified according to the local context in which they occur. The first one is a mapping of the vocabulary words in a fixed number of classes according to a Kullback-Leibler measure. In the second one, similar words are clustered in classes whose number is not fixed in advance. This work has been performed with French training data coming from two domains, both different in size and vocabulary.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.