Classification and Analysis of Textual data using Naive Bayes with TF-IDF

Chingmuankim Chingmuankim,Rajni Jindal

doi:10.1109/icecie55199.2022.10000309

Abstract

Text classification has become an emerging topic in this modern era as it allow us to extract meaningful information from the data and improve the performance of business and organization. Often termed as Text Tagging or Categorization, these textual data can be Structured, Semi-Structured and Unstructured. This work has utilized unstructured data with the help of twitter API. These unstructured data are then structured using NLP cloud API as the process of manual sorting is time consuming and tedious. The structured textual data comprises of a set of categorical data that is labelled on the basis of the content of the comments. Text Classification has various use cases such as Sentiment analysis, Polarity Checking, Natural Language Inference and accessing grammatical correctness. Earlier experimental work has been carried out using Naive Bayes with a Bag of Words (BOW) feature extraction technique by previous researchers. The objective of this work is to analyze the transformed structured imbalanced data and study the impact it has on the accuracy of Naive Bayes model using Term Frequency-Inverse Document Frequency (TF-IDF) technique. Naive Bayes is a linear, probabilistic and supervised machine learning classifier based on Bayesian theorem. On training and testing the data using the proposed model, it is found that there is an improvement in the overall accuracy with 2-3%.

Full Text