Abstract

COVID-19 has provoked enormous negative impacts on human lives and the world economy. In order to help in the fight against this pandemic, this study evaluates different databases’ systems and selects the most suitable for storing, handling, and mining COVID-19 data. We evaluate different SQL and NoSQL database systems using the following metrics: query runtime, memory used, CPU used, and storage size. The databases systems assessed were Microsoft SQL Server, MongoDB, and Cassandra. We also evaluate Data Mining algorithms, including Decision Trees, Random Forest, Naive Bayes, and Logistic Regression using Orange Data Mining software data classification tests. Classification tests were performed using cross-validation in a table with about 3 M records, including COVID-19 exams with patients’ symptoms. The Random Forest algorithm has obtained the best average accuracy, recall, precision, and F1 Score in the COVID-19 predictive model performed in the mining stage. In performance evaluation, MongoDB has presented the best results for almost all tests with a large data volume.

Highlights

  • Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), 3030-199 Coimbra, Portugal; Centre of Informatics and Systems of University of Coimbra (CISUC), 3030-290 Coimbra, Portugal; FATEC Mogi das Cruzes, São Paulo Technological College, Mogi das Cruzes 08773-600, Brazil

  • This work focuses on two main areas: SQL and NoSQL databases, and Data Mining

  • Orange Data Mining was connected to Microsoft SQL Server, and the Data Mining tests were performed, the audit trail controlled all queries that Orange needed to make in the database to perform the tests

Read more

Summary

Introduction

The human species has already witnessed several pandemics during its existence. A pandemic is an epidemic occurring on a scale that crosses international boundaries, usually affecting many people [1]. Data Mining algorithms were used in classification problems to extract insights from the collected data and develop a COVID-19 predictive model with suitable accuracy. We evaluate one SQL database, Microsoft SQL Server [11], and two of the most popular NoSQL databases, MongoDB [12] and Cassandra [13] This evaluation was performed using real COVID-19 datasets by comparing the different databases in terms of query runtime, RAM consumed, CPU percentage used, and data storage size. The first goal of this work is to create a COVID-19 database and mine it to extract insights from the data.

Related Work
SQL versus NoSQL Databases
Data Mining on COVID-19 Data
Data Mining
Algorithms
Naïve Bayes
Decision Tree
Random Forest
Logistic Regression
Data Modeling
Experimental Evaluation
Data Mining Experiments—Classification Tests
Score is the weighted average of Precision and Recall
SQL and NoSQL Database Experiments
11. Runtime for Query
13. Runtime for Query to
14. Querywere
15. Runtime
20. Runtime andtoCPU
Conclusions and Future
Findings
Dataset
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call