Abstract

This work utilizes the Bayes formula to vectorize a document according to a probability distribution based on keywords reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics (categories). Using this probability distribution as the vectors to represent the document, the text classification algorithms based on the vector space model, such as the Support Vector Machine (SVM) and Self-Organizing Map (SOM) can then be used to classify the documents on a multi-dimensional level, thus improving on the results obtained using only the highest probability to classify the document, such as that achieved by implementing the naive Bayes classifier by itself. The effects of an inadvertent dimensionality reduction can be overcome using these algorithms. We compare the performance of these classifiers for high dimensional data.

Highlights

  • The self-organizing map (SOM) is a clustering method which clusters data, based on a similarity measure related to the calculation of Euclidean distances

  • The naĂŻve Bayes is used as a pre-processor in the front end of the classification algorithms based on the vector space model, in our case, the Support Vector Machine (SVM) and the SOM, to vectorize text documents before the training and classifying stages are carried out

  • Since the naĂŻve Bayes classifier is able to handle raw text data via the probability distribution calculated from key word occurrence, and the text classifiers based on the vector space model such as the SVM and the SOM typically requires preprocessing to vectorize the raw text documents into numerical values, it is natural to use the naĂŻve Bayes as the vectorizer for the classifiers based on the vector space model

Read more

Summary

The Hybrid Classification Approach

Design, implement and evaluate a hybrid classification method by integrating the naĂŻve Bayes vectorizer and text classifiers based on the vector space model to take advantage of the simplicity of the Bayes technique and the accuracy of the SVM and the SOM classification approaches. All the training documents are vectorized by their probability distribution in feature space, in the format of numerical multi-dimensional arrays, with the number of dimensions depending on the number of categories With this transformation, the training documents are suitable for use in constructing the vectorized training dataset for the classifiers based on the vector space model. The output from the naĂŻve Bayes vectorizer, which is the vectorized data of the text documents, in the format of multi-dimensional numerical probability values, is used as the input for the SVM or the SOM for the final classification steps

The NaĂŻve Bayes Classification Approach for Vectorization
Support Vector Machines for Text Classification
The Self-Organizing Map for Text Classification
The Evaluations and Experimental Results
Findings
Conclusion and Future Works
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.