Large amounts of data are often categorized using different systems. In such cases, few-shot and unsupervised text classification are the two main approaches for dynamically classifying text into a single classification. Unsupervised text classification typically exhibits lower performance but requires significantly less data preparation effort and computing resources than the few-shot approach. This study proposes two methods to enhance unsupervised text classification for domain-specific non-English text using improved domain corpus embedding: 1) weighted embedding-based anchor word clustering (wean-Clustering), and 2) cosine-similarity-based classification using a defect corpus that is vectorized by fine-tuned pretrained language models (sim-Classification-ftPLM). The proposed methods were tested on 40,765 Korean building defect complaints and achieved F1 scores of 89.12% and 84.66% respectively, outperforming the state-of-the-art zero-shot (53.79%) and few-shot (72.63%) text classification methods, with minimal data preparation effort and computing resources.
Read full abstract