Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset

Tula Kanta Deo,Rajesh Keshavrao Deshmukh,Gajendra Sharma

doi:10.3126/njmr.v7i2.68189

Abstract

Background: Text classification techniques are increasingly important with the exponential growth of textual data on the internet. Term Frequency-Inverse Document Frequency (TF-IDF) and Count Vectorizer(CV) are commonly used methods for feature extraction. TF-IDF assigning weights to terms based on their frequency. CV simply counts the occurrences of terms. The performance of CV as well as TF-IDF are evaluated and compared with KNN and DT classifiers across text datasets. Methodology: The investigation begins with preprocessing. The feature vectors are created using both TF-IDF and CV. Feature vectors are passed into the KNN and DT classifiers at in training stage. Experiments are executed the usage of Kaggle's public database Ukraine 10K tweets sentiment_analysis dataset and the Womens ecommerce clothing reviews dataset. Findings: The average of precision, recall, f1 score and accuracy of KNN with TF-IDF were 84.5%, 87%, 83%, 87% respectively and KNN with CV were 83.5%, 87%, 83.5%, 87% respectively. Similarly, average of precision, recall, f1 score and accuracy of DT with TF-IDF were 89%, 89%, 89%, 89% respectively and DT with CV were 89%, 89.5%, 89.5%, 89.5% respectively. The results obtained in this research is consistent with previous similar research result. Conclusions: The performance of TF-IDF is almost similar as CV for a particular dataset and a particular classifier in this study. Novelty: The experiment performed using these classifiers and feature extraction methods on the datasets is a novelty and contribution of this research.

Full Text