Abstract

Social media is a great source of data for analyses, since they provide ways for people to share emotions, feelings, ideas, and even symptoms of diseases. By the end of 2019, a global pandemic alert was raised, relative to a virus that had a high contamination rate and could cause respiratory complications. To help identify those who may have the symptoms of this disease or to detect who is already infected, this paper analyzed the performance of eight machine learning algorithms (KNN, Naive Bayes, Decision Tree, Random Forest, SVM, simple Multilayer Perceptron, Convolutional Neural Networks and BERT) in the search and classification of tweets that mention self-report of COVID-19 symptoms. The dataset was labeled using a set of disease symptom keywords provided by the World Health Organization. The tests showed that Random Forest algorithm had the best results, closely followed by BERT and Convolution Neural Network, although traditional machine learning algorithms also have can also provide good results. This work could also aid in the selection of algorithms in the identification of diseases symptoms in social media content.

Highlights

  • True Positives (TP) is when the model classifies an instance as positive and the real class is positive; False Positives (FP) is the classification of an instance as positive but its real class is negative; True Negatives (TN) is the classification of an instance as negative and the real class is negative; False Negatives (FN) is the classification of an instance as negative but its real class is positive

  • This model presented a high number of hits in the TN classification, compared to the other models, Bidirectional Encoder Representations from Transformers (BERT) presented best number of correct classes for the negative examples

  • K-Nearest Neighbors (KNN) and Naive Bayes had the worst results in FP and FN, respectively, they presented the worst results when classifying correctly the instances

Read more

Summary

Introduction

New information about the behavior of the virus and the mapping of its transmissibility has been obtained These data emerged from several sources, such as population health surveys, group studies or mathematical models, where investigations of causes, and relationship between risk factors and health consequences are made [2,3]. Since their advent, social networks have been widely used as a way for people to express emotions, feelings, opinions and information, as well as health concerns and symptoms, making these communication media potential sources for collecting and building a database of self-reported symptoms [4,5]

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.