Keyword extraction is the foundation for solving various text mining tasks. However, the literature heavily relies on statistical, linguistic feature-based, or graph-based metrics to gauge corpus-representative keywords, the process of which is sensitive to preprocessing and stopword selection. In this paper, we propose a general-use keyword extraction model designed to work with document groups of various sizes, domains, and readability, as well as the existence of keyword labels. To extract a better selection of keywords, we employ a simple logistic regression model with the least absolute shrinkage and selection operator regularization (Tibshirani, 1996). The classification-based structure of our approach ensures learning words that distinctively characterize the given document group against the comparison groups, enhancing the representativeness of the extracted keywords. Furthermore, our model repeatedly modifies coefficients as it learns the document label classifiers, rather than relying directly on the term frequencies, reducing the model’s sensitivity to words of very high and very low frequencies. We test our model’s performance against numerous classic keyword extraction frameworks as baseline models using online customer reviews, news articles, and patent documentation. The results indicate that our proposed method has robust performance in terms of representability and distinctiveness across document groups with varying sizes, number of class labels, levels of readability, and domain. Additionally, we show that our model beats baseline models even when applied to documents without class labels compared with the baseline models. Given its generalizability and simplicity, we believe that our proposed model may serve as an easy-to-use, yet a powerful, general-use tool for keyword extraction, especially when working with various groups of documents from different domains.
Read full abstract