Statistical properties of English text produced by Korean and Chinese authors

Robert Nelson

doi:10.1558/jrds/7207888858

Abstract

This article presents findings from a comparison of the Zipf’s law and Heaps’ law properties of English texts produced by second language writers of English to similar texts produced by native writers. Zipf’s law is a famous statistical law capturing the distribution of words in texts, while Heaps’ law describes the rate at which new words are encountered as one reads a text. The analysis of the Zipf properties of texts by writers whose first language (L1) is Korean shows that the distribution of words and multiword sequences (n-grams) is different from that observed in native speakers’ texts. These differences imply that Korean writers do not fully exploit English pronominals. Analyses also indicate that the Heaps’ law properties of texts produced by the L1 and L2 writers show different rates of lexical innovation. A number of studies have estimated the Zipf’s and Heaps’ properties of native language texts in various contexts (e.g., Li, 1992; Cancho, 2005), while others have considered the consequences of Zipf’s law for second language acquisition (Laufer and Nation, 1995; Meara, 2005; Ellis, 2012). This study is the first attempt to estimate the parameters of these laws for texts produced by different populations of the second language writers.

Full Text