An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese

Feng-Long Huang,Chien-Yo Hwang,Ming-Shing Yu

doi:10.4236/jcc.2013.15003

Abstract

Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.

Highlights

Speech processing (SP) studies the domain of speech signals and the processing methods of these digital signals
We studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus
Academic Sinica Balanced Corpus version 3.0 (ASBC) is composed of 9228 text files distributed in different fields, occupying 118 MB and near 5 millions of Chinese words labeled with POS tag

Summary

Introduction

Speech processing (SP) studies the domain of speech signals and the processing methods of these digital signals. It is always combined into natural language processing (NLP). Speech Processing may divide into two broad domains: Speech Recognition and speech synthesis. The former is to recognize the speech signal with respect to the text output and the latter is to synthesize the speech with frequent prosody for the text or articles inputs. In many domains of natural language processing (NLP); such as speech recognition [1], and machine translation [2]; the statistical language models (LMs) [3] play an important role in natural language processing

Methods

Results

Conclusion