Abstract
Abstract Various methods have been developed for identifying keywords/key clusters. Most of these methods use a reference corpus to identify keywords/key clusters in the target corpus although a few studies have employed methods for key word/cluster identification without the use of a reference corpus. However, little research appears to have been done comparing the effectiveness of these methods, especially when they are used for identifying key clusters, a relatively new concept than keywords. To address this research gap, this study compares the accuracy and effectiveness of the following five methods in identifying key clusters in a corpus of Charles Dickens’s novels without the use of a reference corpus: TF (Term Frequency, a common frequency measure), DPnorm (Deviation of Proportions normalized, a robust and effective dispersion measure), and PPMI (Positive Pointwise Information, a widely used association strength measure), and TF-IDF (Term Frequency—Inverse document, a blended method that considers both term frequency and inverse document frequency), and TF-DPnorm (Term Frequency-DP normalized), a self-developed blended method that factors in both frequency and normalized dispersion. With the top key clusters that Mahlberg (2007) identified in the same Dickens’s corpus of novels as the benchmark, the results of the comparison show that, of the five methods, the self-developed TF-DPnorm method and the TF method are the most accurate and effective in identifying key clusters in literary texts when no reference corpus is used. Reasons for the differences across the methods are explored and research implications are also discussed.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have