Distribution shift resilient discrimination information space for SVM classification

Khurum Nazir Junejo

doi:10.1109/iemcon.2017.8117148

Abstract

There has been a phenomenal increase in the utility of text classification (TC) in applications like targeted advertisement and sentiment analysis. Most applications demand that the model be efficient and robust, yet produce accurate categorizations. This is quite challenging as their is a dearth of labelled training data because it requires assigning labels after reading the whole document. Secondly, the people labelling the documents may not agree on a particular categorization. Therefore, pre-labelled data from a different source, domain, or a user may be used to augment (or replace) the train data. This results in the difference of distribution between the train and test sets. Additionally, with time, the distribution of the test set may change as well. This change in distribution between the train and test sets violates the inductive inference hypothesis that is underlying any machine learning (ML) prediction model, with some ML models being more sensitive towards this phenomenon than others. The performance of support vector machines (SVM) (which is one of the most successful classifiers for TC) degrades drastically in such scenarios. Therefore, in this paper we propose a novel and efficient method that uses terms-based discriminative information space to train SVM for scenarios where distribution shift exists between train and test sets. Our results on eight different train and test pairs from four different domains suggest that the performance gain achieved by SVM trained in the discriminative information space is significantly greater than the performance of SVM trained on the input feature space. Moreover, the methodology is simple, effective, and fast with a small and tunable memory footprint.

Full Text