Abstract

COVID-19 is one of the deadliest viruses, which has killed millions of people around the world to this date. The reason for peoples' death is not only linked to its infection but also to peoples' mental states and sentiments triggered by the fear of the virus. People's sentiments, which are predominantly available in the form of posts/tweets on social media, can be interpreted using two kinds of information: syntactical and semantic. Herein, we propose to analyze peoples' sentiment using both kinds of information (syntactical and semantic) on the COVID-19-related twitter dataset available in the Nepali language. For this, we, first, use two widely used text representation methods: TF-IDF and FastText and then combine them to achieve the hybrid features to capture the highly discriminating features. Second, we implement nine widely used machine learning classifiers (Logistic Regression, Support Vector Machine, Naive Bayes, K-Nearest Neighbor, Decision Trees, Random Forest, Extreme Tree classifier, AdaBoost, and Multilayer Perceptron), based on the three feature representation methods: TF-IDF, FastText, and Hybrid. To evaluate our methods, we use a publicly available Nepali-COVID-19 tweets dataset, NepCov19Tweets, which consists of Nepali tweets categorized into three classes (Positive, Negative, and Neutral). The evaluation results on the NepCOV19Tweets show that the hybrid feature extraction method not only outperforms the other two individual feature extraction methods while using nine different machine learning algorithms but also provides excellent performance when compared with the state-of-the-art methods.

Highlights

  • Natural language processing (NLP) techniques have been developed to assess peoples’ sentiments on various topics

  • We choose nine widely used machine learning classifiers: Logistic Regression (LR), Random Forest (RF), Naive Bayes (NB), K-Nearest Neighbour (KNN), Decision Tree (DT), Extra Tree Classifier (ETC), Adaptive Boosting (AdaBoost), Multilayer Perceptron-Neural network (MLP-NN), and Support Vector Machine (SVM). e selection of classifiers in this study is made based on their abilities to impart the promising classification accuracy of both Nepali and non-Nepali document analysis [1,7,25] in the literature. e short description of each classifier is presented in the following paragraphs

  • We have proposed to use hybrid features (FastText + TermFrequency and Inverse Document Frequency (TF-IDF)) to represent Nepali COVID-19-related tweets for the sentiment classification

Read more

Summary

Introduction

Natural language processing (NLP) techniques have been developed to assess peoples’ sentiments on various topics. Recent works [1–8] on COVID-19 tweets sentiment analysis in English and other languages [8] underscore the efficacy of data-driven machine learning approaches, where they employed several kinds of analysis such as topic modeling, classification, and clustering. This urges the thorough comparison of machine learning methods in sentiment analysis with the better representation of tweets for sentiment classification They used popular feature extraction methods such as TF-IDF Frequency-Inverse and Document Frequency) and word embedding methods such as word2vec [9], Glove [10], and FastText [11] With such existing works, we listed three main limitations on Nepali COVID-19-related tweet representation and classification. There is no study on a detailed comparison of machine learning (ML) methods for the sentiment classification on the COVID-19-related tweets dataset, in the Nepali language.

Related Works
Proposed Approach
Preprocessing
TF-IDF Feature Extraction
Word Embedding Feature Extraction
Feature Fusion
Classification
K-Nearest
Support Vector
Experiment and Analysis
Evaluation Metrics
Implementation
Comparative Study of ML Classifiers on ree Different Features
Class-Wise Study of Classifiers’ Performance on Hybrid Features
Comparison of Our Method with the State-of-the-Art Methods
Conclusion and Future Works
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call