A Brief Survey of Text Document Classification Algorithms and Processes

Nihar M Ranjan,Rajesh S Prasad

doi:10.46610/jodmm.2023.v08i01.002

Abstract

The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.

Full Text