Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Amit Purushottam Pimpalkar,R Jeberson Retna Raj

doi:10.14201/adcaij2020924968

Abstract

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme.  For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification. The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal	Publication Date: Jun 18, 2020
Citations: 41	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Abstract

Talk to us

Similar Papers

More From: ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal

Lead the way for us

Similar Papers

Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques
Abdullah Y Muaad ... D.S Guru
Mathematical Problems in Engineering | VOL. 2022
Abdullah Y Muaad, et. al.Abdullah Y Muaad ... D.S Guru
30 Apr 2022
Mathematical Problems in Engineering | VOL. 2022

Software Requirements Classification Using Machine Learning Algorithms.
Edna Dias Canedo ... Bruno Cordeiro Mendes
Entropy (Basel, Switzerland) | VOL. 22
Edna Dias Canedo, et. al.Edna Dias Canedo ... Bruno Cordeiro Mendes
21 Sep 2020
Entropy (Basel, Switzerland) | VOL. 22

Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification
Yousif A Alhaj ... Mohamed Abd Elaziz
-
Yousif A Alhaj, et. al.Yousif A Alhaj ... Mohamed Abd Elaziz
30 Nov 2019
30 Nov 2019

HAPI: An efficient Hybrid Feature Engineering-based Approach for Propaganda Identification in social media.
Akib Mohi Ud Din Khanday ... Mudasir Ahmad Wani
PloS one | VOL. 19
Akib Mohi Ud Din Khanday, et. al.Akib Mohi Ud Din Khanday ... Mudasir Ahmad Wani
10 Jul 2024
PloS one | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Abstract

Talk to us

Similar Papers

More From: ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal