Classifying Documents with Poisson Mixtures

Hiroshi Ogura,Masato Kondo,Hiromi Amano

doi:10.14738/tmlai.24.388

Abstract

Although the Poisson distribution and two well-known Poisson mixtures (the negative binomial and K-mixture distributions) have been utilized as tools for modeling texts for over last 15 years, the application of these distributions to build generative probabilistic text classifiers has been rarely reported and therefore the available information on applying such models to classification remains fragmentary and even contradictory. In this study, we construct generative probabilistic text classifiers with these three distributions and perform classification experiments on three standard datasets in a uniform manner to examine the performance of the classifiers. The results show that the performance is much better than that of the standard multinomial naive Bayes classifier if the normalization of document length is appropriately taken into account. Furthermore, the results show that, in contrast to our intuitive expectation, the classifier with the Poisson distribution performs best among all the examined classifiers, even though the Poisson model gives a cruder description of term occurrences in real texts than the K-mixture and negative binomial models do. A possible interpretation of the superiority of the Poisson model is given in terms of a trade-off between fit and model complexity.

Full Text