Machine Learning-based Flu Forecasting Study Using the Official Data from the Centers for Disease Control and Prevention and Twitter Data

Ali Wahid,Samuel Sambasivam,Steven Munkeby

doi:10.28945/4796

Ali Wahid, Samuel Sambasivam + Show 1 more

Open Access

https://doi.org/10.28945/4796

Copy DOI

Abstract

Aim/Purpose: In the United States, the Centers for Disease Control and Prevention (CDC) tracks the disease activity using data collected from medical practice's on a weekly basis. Collection of data by CDC from medical practices on a weekly basis leads to a lag time of approximately 2 weeks before any viable action can be planned. The 2-week delay problem was addressed in the study by creating machine learning models to predict flu outbreak. Background: The 2-week delay problem was addressed in the study by correlation of the flu trends identified from Twitter data and official flu data from the Centers for Disease Control and Prevention (CDC) in combination with creating a machine learning model using both data sources to predict flu outbreak. Methodology: A quantitative correlational study was performed using a quasi-experimental design. Flu trends from the CDC portal and tweets with mention of flu and influenza from the state of Georgia were used over a period of 22 weeks from December 29, 2019 to May 30, 2020 for this study. Contribution: This research contributed to the body of knowledge by using a simple bag-of-word method for sentiment analysis followed by the combination of CDC and Twitter data to generate a flu prediction model with higher accuracy than using CDC data only. Findings: The study found that (a) there is no correlation between official flu data from CDC and tweets with mention of flu and (b) there is an improvement in the performance of a flu forecasting model based on a machine learning algorithm using both official flu data from CDC and tweets with mention of flu. Recommendations for Practitioners: In this study, it was found that there was no correlation between the official flu data from the CDC and the count of tweets with mention of flu, which is why tweets alone should be used with caution to predict a flu out-break. Based on the findings of this study, social media data can be used as an additional variable to improve the accuracy of flu prediction models. It is also found that fourth order polynomial and support vector regression models offered the best accuracy of flu prediction models. Recommendations for Researchers: Open-source data, such as Twitter feed, can be mined for useful intelligence benefiting society. Machine learning-based prediction models can be improved by adding open-source data to the primary data set. Impact on Society: Key implication of this study for practitioners in the field were to use social media postings to identify neighborhoods and geographic locations affected by seasonal outbreak, such as influenza, which would help reduce the spread of the disease and ultimately lead to containment. Based on the findings of this study, social media data will help health authorities in detecting seasonal outbreaks earlier than just using official CDC channels of disease and illness reporting from physicians and labs thus, empowering health officials to plan their responses swiftly and allocate their resources optimally for the most affected areas. Future Research: A future researcher could use more complex deep learning algorithms, such as Artificial Neural Networks and Recurrent Neural Networks, to evaluate the accuracy of flu outbreak prediction models as compared to the regression models used in this study. A future researcher could apply other sentiment analysis techniques, such as natural language processing and deep learning techniques, to identify context-sensitive emotion, concept extraction, and sarcasm detection for the identification of self-reporting flu tweets. A future researcher could expand the scope by continuously collecting tweets on a public cloud and applying big data applications, such as Hadoop and MapReduce, to perform predictions using several months of historical data or even years for a larger geographical area.

Full Text