세종 말뭉치에 나타난 한국어 음절의 빈도와 분포

Eun-Ha Lee,Kichun Nam

doi:10.21296/jls.2020.3.92.79

Abstract

The present study aims at building a database of Korean syllable frequencies and distributions as a useful resource that could be consulted by researchers in psycholinguistics and other adjacent disciplines. In doing so, we produced a set of syllable token/type frequency lists by word classes and positions within an eojeol/headword compiled from the Sejong Corpus containing 15 million eojeols of written texts. The important results include the following: Firstly, the power law was observed, which is characterized by the phenomena that most tokens/types are accounted for by a small number of syllables. Secondly, there was a strong tendency that the token/type frequencies of eojeol/headword syllables decrease as a function of their phonological complexity. Lastly, substantial differences in phonological and morphological aspects were found between the first and second syllables of eojeols/headwords. The database containing 26 different syllable frequency lists can be freely shared via the GitHub repository of one of the authors.

Full Text