Understanding the Area of Applicability of Data Driven Mosquito Abundance Prediction Models

Theoktisti Makridou,Konstantinos Tsaprailis,George Arvanitakis,Charalampos Kontoes,Diletta Fornasiero

doi:10.5194/egusphere-egu23-15398

Abstract

An Early Warning System for mosquito abundance is a valuable tool that can alert authorities for potential outbreaks of mosquito populations in a given area for the upcoming period. This information is used to take mitigation actions in order to avoid spread of vector borne diseases such as West-Nile Virus, Malaria, Zika etc. A promising direction of those systems today aims to predict the upcoming mosquito population by following a data driven approach and taking advantage of machine learning (ML) algorithms. The ML algorithms are trained on a limited set of point level data that include the environmental, geomorphological, climatic information and historical in-situ measurements of mosquito population for specific latitude and longitude coordinates. Goal of the ML algorithms is to learn the patents that connect the characteristics (features) of a given area (temperature, humidity, NDVI, rainfall, latitude, longitude, etc) with the upcoming mosquito population.&#160;Once the in-situ entomological data are expensive to be collected and limited, one of the key challenge of the aforementioned approach is to understand where those models can generalize with an acceptable accuracy in order to be re-used in areas that prior entomological information do not exist or in other words to understand the area of applicability of those models.&#160;In this study we analyze the performance of ML algorithms that have been trained in specific areas and applied to &#8220;unseen&#8221; areas. Our analysis aims to understand the characteristics of the cases where the algorithms manage to generalize compared with the ones where the performance is poor. Our scope is to establish a systematic approach for determining the area of applicability of the models, thus, to obtain a prior knowledge regarding the areas that we expect models to generalize properly and the areas the predictions of the models are not trustworthy.&#160;Our work relied on historical data of Culex pipiens mosquitoes (West Nile virus) collected in the Veneto region of Italy for the decade 2011-2021 and satellite Earth Observation data. For ML regressor we used a feedforward Neural Network with typical mean square error cost function. Initially we conclude that the typical euclidian distance between the coordinates of the trained area and the unseen data is not an informative metric about the model&#8217;s area of applicability. Instead, we propose a metric that calculates the distance between the known and the unknown points in the feature space (environmental, geomorphological etc.) and also takes into account the feature importance of trained Neural Network using the SHAP values.&#160;The results showed that our proposed metric is informative regarding where the model is expected to have more accurate predictions and manage to capture the cases where the generalization will be poor. This information is useful both to judge if the predictions of a model are trustworthy and also to understand for which areas our prior information is not sufficient and to take actions in future network planning.

Full Text