Lexicon-based approaches to Document Classification are widely used, but the manual construction of lexicons can be time-consuming and resource-intensive. In this paper, we propose methods for automating the generation of lexicons later used for Document Classification. We explored diverse methods for generating lexicons, including semantic matches, frequency-based approaches, machine learning algorithms, and large language model techniques. We, later, used these lexicons to classify documents based on their content. By comparing our different lexicons results on a same task, based on criteria such as scalability and the F1 score, we determine optimized use-case for those methods. We show that our automated approaches are effective and efficient, producing accurate classifications with minimal human intervention. Some approaches have the potential to streamline the document classification process, reducing the time and resources required for manual lexicon generation, it also gives optimized use-case for the different methods. Thereafter, we discussed the obtained results.
Read full abstract