Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model

Dino Isa,V P Kallimani,R Rajkumar,Lee Lam Hong

doi:10.5539/cis.v1n4p79

Abstract

This work utilizes the Bayes formula to vectorize a document according to a probability distribution based on keywords reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics (categories). Using this probability distribution as the vectors to represent the document, the text classification algorithms based on the vector space model, such as the Support Vector Machine (SVM) and Self-Organizing Map (SOM) can then be used to classify the documents on a multi-dimensional level, thus improving on the results obtained using only the highest probability to classify the document, such as that achieved by implementing the naive Bayes classifier by itself. The effects of an inadvertent dimensionality reduction can be overcome using these algorithms. We compare the performance of these classifiers for high dimensional data.

Highlights

The self-organizing map (SOM) is a clustering method which clusters data, based on a similarity measure related to the calculation of Euclidean distances
The naïve Bayes is used as a pre-processor in the front end of the classification algorithms based on the vector space model, in our case, the Support Vector Machine (SVM) and the SOM, to vectorize text documents before the training and classifying stages are carried out
Since the naïve Bayes classifier is able to handle raw text data via the probability distribution calculated from key word occurrence, and the text classifiers based on the vector space model such as the SVM and the SOM typically requires preprocessing to vectorize the raw text documents into numerical values, it is natural to use the naïve Bayes as the vectorizer for the classifiers based on the vector space model

Summary

The Hybrid Classification Approach

Design, implement and evaluate a hybrid classification method by integrating the naïve Bayes vectorizer and text classifiers based on the vector space model to take advantage of the simplicity of the Bayes technique and the accuracy of the SVM and the SOM classification approaches. All the training documents are vectorized by their probability distribution in feature space, in the format of numerical multi-dimensional arrays, with the number of dimensions depending on the number of categories With this transformation, the training documents are suitable for use in constructing the vectorized training dataset for the classifiers based on the vector space model. The output from the naïve Bayes vectorizer, which is the vectorized data of the text documents, in the format of multi-dimensional numerical probability values, is used as the input for the SVM or the SOM for the final classification steps

The Naïve Bayes Classification Approach for Vectorization

Support Vector Machines for Text Classification

The Self-Organizing Map for Text Classification

The Evaluations and Experimental Results

Findings

Conclusion and Future Works

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer and Information Science	Publication Date: Oct 18, 2008
Citations: 11	License type: cc-by

R Discovery Prime

R Discovery Prime

Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer and Information Science

Lead the way for us

Similar Papers

Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine
D Isa ... L.H Lee
IEEE Transactions on Knowledge and Data Engineering | VOL. 20
D Isa, et. al.D Isa ... L.H Lee
01 Sep 2008
IEEE Transactions on Knowledge and Data Engineering | VOL. 20

Using the self organizing map for clustering of text documents
Dino Isa ... Lam Hong Lee
Expert Systems with Applications | VOL. 36
Dino Isa, et. al.Dino Isa ... Lam Hong Lee
31 Jul 2008
Expert Systems with Applications | VOL. 36

Study on Key Technology of Topic Tracking Based on SVM
Shengdong Li ... Yuqin Li
-
Shengdong Li, et. al.Shengdong Li ... Yuqin Li
01 Aug 2010
01 Aug 2010

A novel model for Document Representation
Asmaa Mountassir
-
Asmaa MountassirAsmaa Mountassir
01 May 2013
01 May 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer and Information Science