Abstract

The term weighting scheme in text categorization is a vital step in automatic text categorization. Previous studies showed that term weighting techniques contribute more to the accuracy of classification than that of the classifier’s contribution for the same. So this work is concentrated on term weighting schemes for text categorization. A new supervised term weighting scheme for text categorization is proposed. The frequency of each term in a document is expressed as probability of the terms in a document. This gives the proportion of each term in a document. This information provides with a very good knowledge on the category of the document. The probability of a term in all the documents of a class when summed up leads to a very important variable which can be used for term weighting in classification. This is basically a document level variable because it is related to the probability of a term in a document. The related new measure is named as td (terms in a document). Its performance when evaluated with reuters-21578 and 20Newsgroup dataset showed interesting increase in performance compared to tf, idf and rf. Compared to rf, this measure works well for both svm (binary classifier) and centroid-based classifiers(multiclass classifier).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.