The Relationship Between Word Length and Average Information Content in Japanese

Yuki Tanida

doi:10.1111/cogs.13302

Abstract

AbstractPiantadosi, Tily, and Gibson analyzed a large‐scale web‐scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2‐ to 4‐gram model (hereafter, longer‐span surprisal) across 11 Indo‐European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large‐scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German‐specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large‐scale but less noisy database. These three studies provide evidence from 11 Indo‐European languages and one Afro‐Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google's web‐scraping database. The results show that Japanese word length can be predicted independently by 2‐ to 4‐gram surprisal.

Full Text