Abstract

Dengue Fever (DF) is a common vector-borne disease with catastrophic health implications. DF prediction modelling is a challenging task, although technologies such as Geographical Information Systems (GIS) and spatial statistics have improved our understanding of dengue dynamics. In this paper, we create a robust data analysis model to (i) provide a better understanding of confirmed dengue fever cases despite missing data, (ii) obtain better insights into risk factors associated with confirmed cases, and (iii) by means of machine learning, create clusters of patients with comparable characteristics. The last was accomplished with a self-organizing feature map (SOFM) and the density-based spatial clustering of applications with noise (DBSCAN). The approaches used to classify confirmed cases were: Decision Tree, k-nearest neighbours, Random Forest, AdaBoost, Support Vector Classification (SVC), CatBoost, and Naive Bayes. The CatBoost classifier achieved the best accuracy for the analysis of confirmed cases. Spatial analysis was conducted using the ordinary least square (OLS) and geographically weighted regression (GWR) models to identify high-risk areas. SOM can group patients with similar features into clusters, then DBSCAN detects and retrieves six clusters from this data. The clustering of confirmed cases increases CatBoost’s modelling prediction accuracy and reveals complex factors that influence prediction accuracy. Because confirmed cases in each cluster have different features, CatBoost is applied to each cluster individually to improve the prediction accuracy. Variable values in each cluster are analysed to clarify the confirmed cases of a specific subset of DF incidents. Overall, OLS outperforms GWR when identifying hotspot areas. The proposed novel, data-driven and machine-learning-based strategy facilitates the understanding and identification of patterns associated with confirmed DF cases. The study's findings can be utilized to cluster historical patient data into groups or subgroups sharing similar variables. Using identifiable patient clusters rather than raw history data improves the model accuracy provided by CatBoost.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.