Abstract

This paper investigates the impact of the probability distribution of a Naive Bayes classifier and the statistical distribution of the underlying feature data on the classifier's performance. Typical Naive Bayes performance assumptions lack quantitative and rigorous evidence in the common literature creating risk in rote application of Naive Bayes. This study investigates these performance assumptions to quantify where they are true, and the risk of maintaining those assumptions when utilizing Naive Bayes classifiers. Naive Bayes classifiers' exceptionally fast training times, performance, ease to implement, and minimal required resources often make them candidates for early classification trials, especially in Natural Language Processing tasks such as sentiment analysis. It is frequently assumed that the performance of a Naive Bayes classifier is heavily reliant with the distribution of the underlying data. This assumption is noted both in standard documentation and academic research and has largely been accepted as truth with little verification. This paper outlines an experiment that tests this assumption with real world sentiment analysis data. Naive Bayes classifiers were tested against non-Gaussian data, non-Gaussian feature weighted data, Gaussian-like data, and synthetically generated Gaussian data to observe the relationship between classifier performance and data distribution. Initial findings suggested that while this assumption is partially true, there may be additional factors heavily related with Naive Bayes performance that are not strictly related to a feature's distribution.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.