Abstract

Large amounts of data are often categorized using different systems. In such cases, few-shot and unsupervised text classification are the two main approaches for dynamically classifying text into a single classification. Unsupervised text classification typically exhibits lower performance but requires significantly less data preparation effort and computing resources than the few-shot approach. This study proposes two methods to enhance unsupervised text classification for domain-specific non-English text using improved domain corpus embedding: 1) weighted embedding-based anchor word clustering (wean-Clustering), and 2) cosine-similarity-based classification using a defect corpus that is vectorized by fine-tuned pretrained language models (sim-Classification-ftPLM). The proposed methods were tested on 40,765 Korean building defect complaints and achieved F1 scores of 89.12% and 84.66% respectively, outperforming the state-of-the-art zero-shot (53.79%) and few-shot (72.63%) text classification methods, with minimal data preparation effort and computing resources.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.