Abstract

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.

Highlights

  • Text classification problems have been widely studied and addressed in many real applications [1,2,3,4,5,6,7,8] over the last few decades

  • (I) Feature Extraction: In general, texts and documents are unstructured data sets. These unstructured text sequences must be converted into a structured feature space when using mathematical modeling as part of a classifier

  • 1 − accuracy), on the other hand, are not widely used for text classification applications because they are insensitive to variations in the number of correct decisions due to the large value of the denominator (TP + true negatives (TN)) [215]

Read more

Summary

Introduction

Text classification problems have been widely studied and addressed in many real applications [1,2,3,4,5,6,7,8] over the last few decades. With recent breakthroughs in Natural Language Processing (NLP) and text mining, many researchers are interested in developing applications that leverage text classification methods. Most text classification and document categorization systems can be deconstructed into the following four phases: Feature extraction, dimension reductions, classifier selection, and evaluations. We discuss the structure and technical implementations of text classification systems in terms of the pipeline illustrated in Figure 1 (The source code and the results are shared as free tools at https://github.com/kk7nc/Text_Classification). The initial pipeline input consists of some raw text data set. Text data sets contain sequences of text in documents as D = {X1, X2, . XN} where Xi refers to a data point (i.e., document, text segment) with s number of sentences such that each sentence includes ws words with lw letters. Each point is labeled with a class value from a set of k different discrete value indices [7]

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.