Abstract

Objectives: Toextract and identify the subjective information of social media user from the unstructured data. To overcome the high dimensionality and sparsity those are the two major challenges in sentiment analysis of text datasets. To increase the model performance by using possibly minimum feature sets in a text classification problem. Methods: We proposed a new filtration method which is applied for the removal of correlated features and zero importance features in addition to the various feature selection methods. The various feature selections such as Mutual Info, Lasso, Recursive Feature Elimination and dimensionality reduction, Principal Component Analysis (PCA) have been used along with the proposed filtration to find the compelling features. This approach was evaluated using three Indian Government Schemes and these tweets were classified using Random Forest classifier. The performance was evaluated using various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc. Findings: In this research, we proposed a model for selecting relevant and non-correlated feature subsets from the unstructured dataset. From this model, accuracy of 92% with the minimum log loss 0.22 was achieved through the minimum number of feature set. Improvements: This study proves that the performance of the model will be improved by overcoming those two problems (dimensionality and sparsity). Here various feature selection methods have been applied with the proposed filtration in order to minimize the number of features. The computing time and the model performance will be improved as a result of decreasing the features. And this will be more effective in case of large datasets. Even though Random Forest performs well in high dimensional datasets we need some more optimization. Keywords: Mutual Information (MI); Lasso (L1); Recursive Feature Elimination (RFE); Random Forest (RF); Principal Component Analysis (PCA)

Highlights

  • According to Digital 2020 Global Overview Report on January 2020, nearly 60% of world’s population is already active in social media and this will increase more than half of the world’s population by the middle of this year

  • In addition to the various metrics as evaluated in present works, log loss was analysed in the proposed work

  • The improvement of proposed model was analysed in terms of computational time for every feature selection

Read more

Summary

Introduction

According to Digital 2020 Global Overview Report on January 2020, nearly 60% of world’s population is already active in social media and this will increase more than half of the world’s population by the middle of this year. Between July and September 2020, more than 180 million people started using social media equating to an average of almost 2 million new users every day. The latest data indicates that more than two-thirds (68%) of world’s population are using social media. Using social media people share their opinions every day about different issues such as events, persons, products, services, politics etc.,. Sentiment analysis in social media plays a vital role in monitoring of public opinion on certain topics. Sentiment analysis has various challenges in which high dimensionality and sparsity are the two

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.