Jackknifing Documents and Additive Smoothing for Naive Bayes with Scarce Data

Vinay Deolalikar

doi:10.1109/icdm.2015.94

Abstract

Naive Bayes (NB) classifiers are well-suited to several applications owing to their easy interpretability and maintainability. However, text classification is often hampered by the lack of adequate training data. This motivates the question: how can we train NB more effectively whentraining data is very scarce?In this paper, we introduce an established subsampling techniquefrom statistics -- the jackknife -- into machine learning. Our approachjackknifes documents themselves to create new pseudo-documents. Theunderlying idea is that although these pseudo-documents do not havesemantic meaning, they are equally representative of the underlyingdistribution of terms. Therefore, they could be used to train any classifierthat learns this underlying distribution, namely, any parametric classifiersuch as NB (but not, for example, non-parametric classifiers such as SVMand k-NN). Furthermore, the marginal value of this additional trainingdata should be the highest precisely when the original data is inadequate. We then show that our jackknife technique is related to the questionof additively smoothing NB via an appropriately defined notion ofadjointness. This relation is surprising since it connects a statisticaltechnique for handling scarce data to a question about the NB model. Accordingly, we are able to shed light on optimal values of the smoothingparameter for NB in the very scarce data regime. We validate our approach on a wide array of standard benchmarks -- both binary and multi-class -- for two event models of multinomial NB. Weshow that the jackknife technique can dramatically improve the accuracyfor both event models of NB in the regime of very scarce training data. Inparticular, our experiments show that the jackknife can make NB moreaccurate than SVM for binary problems in the very scarce training dataregime. We also provide a comprehensive characterization of the accuracyof these important classifiers (for both binary and multiclass) in the veryscarce data regime for benchmark text datasets, without feature selectionand class imbalance.

Full Text