Can word frequency norms based on small spoken corpora compete with norms based on popular written corpora?

Hayk Abrahamyan

doi:10.1121/1.5137569

Abstract

Current word frequency norms used in speech research are based on written corpora. Recently, Brysbaert and New (2009) presented newer frequency counts based on larger corpus of film and television subtitles, on the idea that these would more closely approximate actual spoken frequencies. Although Brysbaert and New (2009) showed that their frequencies significantly better predict visual reaction time data, they still used corpora with text that was originally written and perhaps heavily edited. In general, most speech researchers prefer using norms from a larger written corpora as norms from spoken corpora are often based on smaller number of tokens (i.e., less then one million). However, many studies use only a small subset of words, frequencies for which may not benefit from a large corpus. Furthermore, many studies dichotomize words into high- and low-frequency groups, thus rendering fine distinctions between frequencies of words computed using a large corpus potentially less useful. The first goal of the current project is to compute word frequency norms using only material from spoken corpora. The second goal is to compare predictions of performance from speech processing experiments of norms from spoken corpora with norms based on popular written corpora. Our preliminary results indicate that for the vast majority of familiar words that are likely to be used in small research studies, any frequency norm even from a small spoken corpus predicts equivalent amount of variance in lexical decision data.Current word frequency norms used in speech research are based on written corpora. Recently, Brysbaert and New (2009) presented newer frequency counts based on larger corpus of film and television subtitles, on the idea that these would more closely approximate actual spoken frequencies. Although Brysbaert and New (2009) showed that their frequencies significantly better predict visual reaction time data, they still used corpora with text that was originally written and perhaps heavily edited. In general, most speech researchers prefer using norms from a larger written corpora as norms from spoken corpora are often based on smaller number of tokens (i.e., less then one million). However, many studies use only a small subset of words, frequencies for which may not benefit from a large corpus. Furthermore, many studies dichotomize words into high- and low-frequency groups, thus rendering fine distinctions between frequencies of words computed using a large corpus potentially less useful. The first goal of t...

Full Text