Abstract

Text categorization is one of the key functions for utilizing vast amount of documents. It can be seen as a classification problem, which has been studied in pattern recognition and machine learning fields for a long time and several classification methods have been developed such as statistical classification, decision tree, support vector machines and so on. Many researchers applied those classification methods to text categorization and reported their performance (e.g., decision tree[3], Bayes classifier[2], support vector machine[l]). Yang conducted comprehensive study of comparison or text categorization and reported that k nearest neighbor and support vector machines works well for text categorization[4].In the previous studies, classification methods were usually compared using single pair of training and test data However, classification method with more complex family of classifiers requires more training data and small training data may result in deriving unreliable classifier, that is, the performance of the derived classifier varies much depending on training data. Therefore, we need to take the size of training data into account when comparing and selecting a classification method. In this paper, we discuss how to select a classifier from those derived by various classification methods and how the size of training data affects the performance of the derived classifier.In order to evaluate the reliability of classification method, we consider the variance of accuracy of derived classifier. We first construct a statistical model. In the text categorization, each document is usually represented with a feature vector that consists of weighted frequencies of terms. In the vector space model, document is a point in high dimensional feature space and a classifier separates the feature space into subspaces each of which is labeled with a category.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call