A study of a statistical model of natural language

P O’Boyle,F J Smith,M Owens

doi:10.1080/03033910.1993.10557945

Abstract

A statistical model of language is described and shown to be surprisingly successful in two experiments based on a statistical analysis of two text corpora. One experiment trained the model on the domain-specific VODIS corpus of 70,000 words, while the other trained it on the Brown corpus of 1 million words, containing text from a wide range of domains. In each experiment the model was tested using unseen phrases from the appropriate corpus and results show that a statistical model can be remarkably successful, even though there is no knowledge of syntax included in the model. Our results also show that the model is most effecti ve when trained and tested on the domain-specific VODIS corpus, in spite ofits small size. It is noted that the voms corpus is a great deal smaller than the total amount of language heard by a child in its first few years of life, which suggests that in the restricted domain of interest to a child there is more than sufficient sample language to build a successful statistical mode...

Full Text