Classifications de mots non étiquetés par des méthodes statistiques

Christel Beaujard,Michèle Jardino

doi:10.4000/msh.2795

Abstract

Our goal is to develop robust language models for speech recognition. These models have to predict a word knowing its history. Although the increasing size of electronic text data, all the possible word sequences of a language cannot be observed. A way to generate these non encountered word sequences is to map words in classes. The class-based language models have a better coverage of the language with a reduced number of parameters, a situation which is favourable to speed up the speech recognition systems. Two types of automatic word classification are described. They are trained on word statistics estimated on texts derived from newspapers and transcribed speech. These classifications do not require any tagging, words are classified according to the local context in which they occur. The first one is a mapping of the vocabulary words in a fixed number of classes according to a Kullback-Leibler measure. In the second one, similar words are clustered in classes whose number is not fixed in advance. This work has been performed with French training data coming from two domains, both different in size and vocabulary.

Full Text