Abstract

Accurate infectious disease forecasting can inform efforts to prevent outbreaks and mitigate adverse impacts. This study compares the performance of statistical, machine learning (ML), and deep learning (DL) approaches in forecasting infectious disease incidences across different countries and time intervals. We forecasted three diverse diseases: campylobacteriosis, typhoid, and Q-fever, using a wide variety of features (n = 46) from public datasets, e.g., landscape, climate, and socioeconomic factors. We compared autoregressive statistical models to two tree-based ML models (extreme gradient boosted trees [XGB] and random forest [RF]) and two DL models (multi-layer perceptron and encoder–decoder model). The disease models were trained on data from seven different countries at the region-level between 2009–2017. Forecasting performance of all models was assessed using mean absolute error, root mean square error, and Poisson deviance across Australia, Israel, and the United States for the months of January through August of 2018. The overall model results were compared across diseases as well as various data splits, including country, regions with highest and lowest cases, and the forecasted months out (i.e., nowcasting, short-term, and long-term forecasting). Overall, the XGB models performed the best for all diseases and, in general, tree-based ML models performed the best when looking at data splits. There were a few instances where the statistical or DL models had minutely smaller error metrics for specific subsets of typhoid, which is a disease with very low case counts. Feature importance per disease was measured by using four tree-based ML models (i.e., XGB and RF with and without region name as a feature). The most important feature groups included previous case counts, region name, population counts and density, mortality causes of neonatal to under 5 years of age, sanitation factors, and elevation. This study demonstrates the power of ML approaches to incorporate a wide range of factors to forecast various diseases, regardless of location, more accurately than traditional statistical approaches.

Highlights

  • We demonstrate the utility of a traditional statistical time series model (GLARMA), machine learning (ML) regression trees (RF and XGB), and deep learning (DL) models (MLP and Enc–Dec) in forecasting diseases, thereby providing insights into how the models perform given different disease lifecycles and incidence across different geographic landscapes and climates as well as region-specific population demography and socioeconomic factors

  • The top-performing models for Overall, the top-performing models forforecasting forecastingcampylobacteriosis, campylobacteriosis,Q-fever, Q-fever,and and typhoid vary by the way the resulting metrics are split

  • Greater than 90% of all model performance based on these features can be contributed to previous case counts, countryregion for non-Alt models, population counts, population density, mortality of neonatal to under 5 years of age, and sanitation with elevation included for Alt models

Read more

Summary

Introduction

The threat posed by these diseases varies widely in terms of mortalities, morbidity, social, and economic disruptions. These threats are further magnified by anthropogenic and ecological factors, such as rapidly increasing population, globalization, urbanization, climate change, administrative conflicts, weak health systems, 4.0/). The wide variety of potential causes for such disease occurrences has made preparedness and timely response a challenge. Due to these reasons and despite a significant improvement in control and prevention efforts over the past few decades, infectious diseases continue to pose a major challenge, causing millions of deaths each year worldwide, especially in low-income countries [2]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call