Investigating the Statistical Assumptions of Naïve Bayes Classifiers

Anthony Kelly,Marc Anthony Johnson

doi:10.1109/ciss50987.2021.9400215

Abstract

This paper investigates the impact of the probability distribution of a Naive Bayes classifier and the statistical distribution of the underlying feature data on the classifier's performance. Typical Naive Bayes performance assumptions lack quantitative and rigorous evidence in the common literature creating risk in rote application of Naive Bayes. This study investigates these performance assumptions to quantify where they are true, and the risk of maintaining those assumptions when utilizing Naive Bayes classifiers. Naive Bayes classifiers' exceptionally fast training times, performance, ease to implement, and minimal required resources often make them candidates for early classification trials, especially in Natural Language Processing tasks such as sentiment analysis. It is frequently assumed that the performance of a Naive Bayes classifier is heavily reliant with the distribution of the underlying data. This assumption is noted both in standard documentation and academic research and has largely been accepted as truth with little verification. This paper outlines an experiment that tests this assumption with real world sentiment analysis data. Naive Bayes classifiers were tested against non-Gaussian data, non-Gaussian feature weighted data, Gaussian-like data, and synthetically generated Gaussian data to observe the relationship between classifier performance and data distribution. Initial findings suggested that while this assumption is partially true, there may be additional factors heavily related with Naive Bayes performance that are not strictly related to a feature's distribution.

Full Text