Abstract
Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.
Highlights
Speech processing (SP) studies the domain of speech signals and the processing methods of these digital signals
We studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus
Academic Sinica Balanced Corpus version 3.0 (ASBC) is composed of 9228 text files distributed in different fields, occupying 118 MB and near 5 millions of Chinese words labeled with POS tag
Summary
Speech processing (SP) studies the domain of speech signals and the processing methods of these digital signals. It is always combined into natural language processing (NLP). Speech Processing may divide into two broad domains: Speech Recognition and speech synthesis. The former is to recognize the speech signal with respect to the text output and the latter is to synthesize the speech with frequent prosody for the text or articles inputs. In many domains of natural language processing (NLP); such as speech recognition [1], and machine translation [2]; the statistical language models (LMs) [3] play an important role in natural language processing
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.