Abstract

The term weighting scheme in text categorization is a vital step in automatic text categorization. Previous studies showed that term weighting techniques contribute more to the accuracy of classification than that of the classifier’s contribution for the same. So this work is concentrated on term weighting schemes for text categorization. A new supervised term weighting scheme for text categorization is proposed. The frequency of each term in a document is expressed as probability of the terms in a document. This gives the proportion of each term in a document. This information provides with a very good knowledge on the category of the document. The probability of a term in all the documents of a class when summed up leads to a very important variable which can be used for term weighting in classification. This is basically a document level variable because it is related to the probability of a term in a document. The related new measure is named as td (terms in a document). Its performance when evaluated with reuters-21578 and 20Newsgroup dataset showed interesting increase in performance compared to tf, idf and rf. Compared to rf, this measure works well for both svm (binary classifier) and centroid-based classifiers(multiclass classifier).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call