The Japanese lexicon is typically classified into at least three etymological strata: native, Sino-Japanese and foreign words. In Tokyo Japanese, nouns from different strata are known to have different phonotactic as well as tonotactic properties. Should one analyze Tokyo Japanese nouns using a non-clustering grammar that generates all nouns using the same phonological grammar, or should one analyze them using a clustering grammar that generates nouns from different strata using different grammars? In this study, I address this question from a probabilistic and a model selection perspective: the better probabilistic grammar is one that better balances fit to data and the number of parameters in the grammar. Using the UCLA Phonotactic Learner, I train two kinds of MaxEnt grammars that correspond to non-clustering and clustering grammars. I compare the two kinds of grammar using the Bayesian Information Crierion (BIC), and show that the non-clustering grammars make a better trade-off between fit to data and model size than non-clustering grammars. Consequently, different etymological strata of the Tokyo Japanese nominal lexicon are better analyzed as being generated from different MaxEnt grammars than from the same MaxEnt grammar.
Read full abstract