Categorising texts more accurately with field association terms

Tshering Cigay Dorji

doi:10.1504/ijcat.2015.071976

Abstract

Popular text classification algorithms such as Naive Bayes, kNN, Centroid-based classifiers and support vector machines SVM are based on supervised machine learning. They normally use classical text representation technique consisting of a 'bag of words' as features. This representation leads to the inclusion of unimportant features, and the loss of important semantic relationships and inflection information, resulting in accuracy reduction. To address this problem, we propose a new text classification methodology based on field association terms - a set of terms that identify specific document fields. The methodology is compared against Naive Bayes, kNN, Centroid-based classifier and SVM on a close dataset of 3180 documents from Wikipedia dumps and open dataset of 9449 documents from Reuters RCV1 Corpus, 20-Newsgroup and 4-Universities datasets. The new method outperformed the other algorithms with a precision of 97% as compared with Centroid-based 85%, Naive Bayes 78%, kNN 48% and SVM 42%.

Full Text