Predicting Coronavirus Pandemic in Real-Time Using Machine Learning and Big Data Streaming System

Xiongwei Zhang,Abdelmgeid A Ali,Radhya Sahal,Eman M G Younis,Hager Saleh,Ahmed Mostafa Khalil

doi:10.1155/2020/6688912

Xiongwei Zhang, Abdelmgeid A Ali + Show 4 more

Open Access

https://doi.org/10.1155/2020/6688912

Copy DOI

Abstract

Twitter is a virtual social network where people share their posts and opinions about the current situation, such as the coronavirus pandemic. It is considered the most significant streaming data source for machine learning research in terms of analysis, prediction, knowledge extraction, and opinions. Sentiment analysis is a text analysis method that has gained further significance due to social networks’ emergence. Therefore, this paper introduces a real-time system for sentiment prediction on Twitter streaming data for tweets about the coronavirus pandemic. The proposed system aims to find the optimal machine learning model that obtains the best performance for coronavirus sentiment analysis prediction and then uses it in real-time. The proposed system has been developed into two components: developing an offline sentiment analysis and modeling an online prediction pipeline. The system has two components: the offline and the online components. For the offline component of the system, the historical tweets’ dataset was collected in duration 23/01/2020 and 01/06/2020 and filtered by #COVID-19 and #Coronavirus hashtags. Two feature extraction methods of textual data analysis were used, n-gram and TF-ID, to extract the dataset’s essential features, collected using coronavirus hashtags. Then, five regular machine learning algorithms were performed and compared: decision tree, logistic regression, k-nearest neighbors, random forest, and support vector machine to select the best model for the online prediction component. The online prediction pipeline was developed using Twitter Streaming API, Apache Kafka, and Apache Spark. The experimental results indicate that the RF model using the unigram feature extraction method has achieved the best performance, and it is used for sentiment prediction on Twitter streaming data for coronavirus.

Highlights

Coronavirus disease or COVID-19 is a novel virus disease that started in the last year of 2019 [1]. e World Health Organization (WHO), on 11th March 2020, stated the outbreak of COVID-19 as a pandemic [2]. e virus, surprisingly fast, has propagated throughout the world; and nearly 28 all the countries are fighting against it, trying their best to stop the propagation of this invisible murderer as much as can be attained [3]
We examine the five machine learning models’ performance using the collected tweets dataset, which is about coronavirus (Twitter hashtags are #Coronavirus and #COVID-19) during the period between 30/01/ 2020 and 01/06/2020. e dataset is preprocessed, including cleaning, tokenization, stop-word removing, and stemming steps, as described in the previous sections. en, the dataset is pretrained into positive, negative, and neutral using TextBlob API. e machine learning models were first trained with 90% of data and tested with the remaining 10% of data. e five machine learning classifiers were implemented using the Scikit-learn 0.21.3 package in Python 3.7. for classification
logistic regression classifier (LR) and K-nearest neighbor (KNN) models have reported consistent testing performances with cross-validation performances, while decision tree (DT), random forest classifier (RF), and support vector machine (SVM) models have not, even though all testing performances are lower compared with cross-validation performances

Summary

Introduction

Coronavirus disease or COVID-19 is a novel virus disease that started in the last year of 2019 [1]. e World Health Organization (WHO), on 11th March 2020, stated the outbreak of COVID-19 as a pandemic [2]. e virus, surprisingly fast, has propagated throughout the world; and nearly 28 all the countries are fighting against it, trying their best to stop the propagation of this invisible murderer as much as can be attained [3]. Ere is no study oriented to sentiment analysis in real-time during the coronavirus pandemic to the best of our knowledge. Erefore, this motivates us to introduce a new real-time sentiment prediction system, including Twitter streaming data for coronavirus pandemic. E proposed system results (i.e., classified predicted people’s opinions about coronavirus) could be emitted to any data source such as real-time reporting and dashboard, storage, and mobile apps. (1) Developing a real-time system to predict the sentiment for coronavirus pandemic using Twitter streaming data (2) Collecting tweets data about coronavirus using #COVID-19 and #Coronavirus hashtags and retraining tweets data using TextBlob (3) Applying different n-gram sizes with the TF-IDF feature selection method (4) Comparing five machine learning classifications to find the optimal model used to predict coronavirus sentiment in real-time e remainder of this paper is organized as follows.

Related Work

Results and Discussion

Discussion