Introduction to the special issue on statistical language modeling

Jianfeng Gao,Chin-Yew Lin

doi:10.1145/1034780.1034781

Abstract

The goal of statistical language modeling (SLM) is to estimate the likelihood (or probability) of a word string. SLM is fundamental to many natural language applications like automatic speech recognition (ASR) [Jelinek 1990], statistical machine translation (SMT) [Brown et al. 1993], and Asian language text input [Gao et al. 2002a]. The research on SLM basically involves two main tasks: modeling and estimation. Modeling is to determine the structure of a statistical model; estimation is to determine the free parameters of the model using training data. SLM usually uses a parametric model with Maximum Likelihood Estimation (MLE) and various smoothing methods to tackle data sparseness problems. Different statistical models have been proposed in the past, but n-gram models (in particular, bigram and trigram models) still dominate SLM research. SLM has recently been demonstrated as an effective framework for a few new applications, such as question answering [Berger 2001], text summarization, paraphrasing [Barzilay and Lee 2004], and information retrieval [Croft and Lafferty 2003]. However, these new applications come with new challenges. For example, in the SLM approaches to information retrieval, a language model has to be trained on a single document, an extremely small training set; while in ASR, a language model is typically trained on a million word corpus. The recent development of related techniques stimulates new modeling and estimation methods that are beyond the scope of the traditional approaches. Two representative examples of such techniques are statistical parsing and discriminative training. With the ever-increasing popularity of SLM, we think that it is the right time to assemble a special issue reflecting recent advances in both its theory and applications. It __________________________________________________________________________________________

Full Text