Spatio-temporal modelling of dengue fever cases in Saudi Arabia using socio-economic, climatic and environmental factors

Ali Siddiq,Nagesh Shukla,Biswajeet Pradhan

doi:10.1080/10106049.2022.2072005

Abstract

Dengue Fever (DF) is a common vector-borne disease with catastrophic health implications. DF prediction modelling is a challenging task, although technologies such as Geographical Information Systems (GIS) and spatial statistics have improved our understanding of dengue dynamics. In this paper, we create a robust data analysis model to (i) provide a better understanding of confirmed dengue fever cases despite missing data, (ii) obtain better insights into risk factors associated with confirmed cases, and (iii) by means of machine learning, create clusters of patients with comparable characteristics. The last was accomplished with a self-organizing feature map (SOFM) and the density-based spatial clustering of applications with noise (DBSCAN). The approaches used to classify confirmed cases were: Decision Tree, k-nearest neighbours, Random Forest, AdaBoost, Support Vector Classification (SVC), CatBoost, and Naive Bayes. The CatBoost classifier achieved the best accuracy for the analysis of confirmed cases. Spatial analysis was conducted using the ordinary least square (OLS) and geographically weighted regression (GWR) models to identify high-risk areas. SOM can group patients with similar features into clusters, then DBSCAN detects and retrieves six clusters from this data. The clustering of confirmed cases increases CatBoost’s modelling prediction accuracy and reveals complex factors that influence prediction accuracy. Because confirmed cases in each cluster have different features, CatBoost is applied to each cluster individually to improve the prediction accuracy. Variable values in each cluster are analysed to clarify the confirmed cases of a specific subset of DF incidents. Overall, OLS outperforms GWR when identifying hotspot areas. The proposed novel, data-driven and machine-learning-based strategy facilitates the understanding and identification of patterns associated with confirmed DF cases. The study's findings can be utilized to cluster historical patient data into groups or subgroups sharing similar variables. Using identifiable patient clusters rather than raw history data improves the model accuracy provided by CatBoost.

Full Text